Word Embeddings: Training the Model
Table of Contents
Building and Training the Model
In the previous post we did some preliminary set up and data pre-processing. Now we're going to build and train a Continuous Bag of Words (CBOW) model.
Imports
# python
from argparse import Namespace
from enum import Enum, unique
from functools import partial
import math
import random
# pypi
from expects import be_true, contain_exactly, equal, expect
import holoviews
import hvplot.pandas
import numpy
import pandas
# this project
from neurotic.nlp.word_embeddings import DataCleaner, MetaData
# my other stuff
from graeae import EmbedHoloviews, Timer
Set Up
Code from the previous post.
cleaner = DataCleaner()
data = cleaner.processed
meta = MetaData(data)
TIMER = Timer(speak=False)
Embed = partial(EmbedHoloviews, folder_path="files/posts/nlp/word-embeddings-training-the-model")
Plot = Namespace(
width=990,
height=780,
fontscale=2,
tan="#ddb377",
blue="#4687b7",
red="#ce7b6d",
)
Something to help remember what the numpy axis
argument is.
@unique
class Axis(Enum):
ROWS = 0
COLUMNS = 1
Middle
Initializing the model
You will now initialize two matrices and two vectors.
- The first matrix (\(W_1\)) is of dimension \(N \times V\), where V is the number of words in your vocabulary and N is the dimension of your word vector.
- The second matrix (\(W_2\)) is of dimension \(V \times N\).
- Vector \(b_1\) has dimensions \(N\times 1\)
- Vector \(b_2\) has dimensions \(V\times 1\).
- \(b_1\) and \(b_2\) are the bias vectors of the linear layers from matrices \(W_1\) and \(W_2\).
At this stage we are just initializing the parameters.
Please use numpy.random.rand to generate matrices that are initialized with random values from a uniform distribution, ranging between 0 and 1.
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: initialize_model
def initialize_model(N: int,V: int, random_seed: int=1) -> tuple:
"""Initialize the matrices with random values
Args:
N: dimension of hidden vector
V: dimension of vocabulary
random_seed: random seed for consistent results in the unit tests
Returns:
W1, W2, b1, b2: initialized weights and biases
"""
numpy.random.seed(random_seed)
### START CODE HERE (Replace instances of 'None' with your code) ###
# W1 has shape (N,V)
W1 = numpy.random.rand(N, V)
# W2 has shape (V,N)
W2 = numpy.random.rand(V, N)
# b1 has shape (N,1)
b1 = numpy.random.rand(N, 1)
# b2 has shape (V,1)
b2 = numpy.random.rand(V, 1)
### END CODE HERE ###
return W1, W2, b1, b2
Test your function example.
tmp_N = 4
tmp_V = 10
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
expect(tmp_W1.shape).to(equal((tmp_N,tmp_V)))
expect(tmp_W2.shape).to(equal((tmp_V,tmp_N)))
expect(tmp_b1.shape).to(equal((tmp_N, 1)))
expect(tmp_b2.shape).to(equal((tmp_V, 1)))
print(f"tmp_W1.shape: {tmp_W1.shape}")
print(f"tmp_W2.shape: {tmp_W2.shape}")
print(f"tmp_b1.shape: {tmp_b1.shape}")
print(f"tmp_b2.shape: {tmp_b2.shape}")
tmp_W1.shape: (4, 10) tmp_W2.shape: (10, 4) tmp_b1.shape: (4, 1) tmp_b2.shape: (10, 1)
Softmax
Before we can start training the model, we need to implement the softmax function as defined in equation 5:
\[ \text{softmax}(z_i) = \frac{e^{z_i} }{\sum_{i=0}^{V-1} e^{z_i} } \tag{5} \]
- Array indexing in code starts at 0.
- V is the number of words in the vocabulary (which is also the number of rows of z).
- i goes from 0 to |V| - 1.
The Implementation
- Assume that the input z to
softmax
is a 2D array - Each training example is represented by a column of shape (V, 1) in this 2D array.
- There may be more than one column, in the 2D array, because you can put in a batch of examples to increase efficiency. Let's call the batch size lowercase m, so the z array has shape (V, m)
- When taking the sum from \(i=1 \cdots V-1\), take the sum for each column (each example) separately.
Please use
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: softmax
def softmax(z: numpy.ndarray) -> numpy.ndarray:
"""Calculate the softmax
Args:
z: output scores from the hidden layer
Returns:
yhat: prediction (estimate of y)
"""
### START CODE HERE (Replace instances of 'None' with your own code) ###
# Calculate yhat (softmax)
yhat = numpy.exp(z)/numpy.sum(numpy.exp(z), axis=Axis.ROWS.value)
### END CODE HERE ###
return yhat
# Test the function
tmp = numpy.array([[1,2,3],
[1,1,1]
])
tmp_sm = softmax(tmp)
print(tmp_sm)
expected = numpy.array([[0.5, 0.73105858, 0.88079708],
[0.5, 0.26894142, 0.11920292]])
expect(numpy.allclose(tmp_sm, expected)).to(be_true)
[[0.5 0.73105858 0.88079708] [0.5 0.26894142 0.11920292]]
Forward propagation
We're going to implement the forward propagation z according to equations (1) to (3).
\begin{align} h &= W_1 \ X + b_1 \tag{1} \\ a &= ReLU(h) \tag{2} \\ z &= W_2 \ a + b_2 \tag{3} \\ \end{align}For that, you will use as activation the Rectified Linear Unit (ReLU) given by:
\[ f(h)=\max (0,h) \tag{6} \]
Hints:
- You can use numpy.maximum(x1,x2) to get the maximum of two values
- Use numpy.dot(A,B) to matrix multiply A and B
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: forward_prop
def forward_prop(x: numpy.ndarray,
W1: numpy.ndarray, W2: numpy.ndarray,
b1: numpy.ndarray, b2: numpy.ndarray) -> tuple:
"""Pass the data through the network
Args:
x: average one hot vector for the context
W1, W2, b1, b2: matrices and biases to be learned
Returns:
z: output score vector
"""
### START CODE HERE (Replace instances of 'None' with your own code) ###
# Calculate h
h = numpy.dot(W1, x) + b1
# Apply the relu on h (store result in h)
h = numpy.maximum(h, 0)
# Calculate z
z = numpy.dot(W2, h) + b2
### END CODE HERE ###
return z, h
Test the function
tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)
print(f"x has shape {tmp_x.shape}")
print(f"N is {tmp_N} and vocabulary size V is {tmp_V}")
tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print("call forward_prop")
print()
print(f"z has shape {tmp_z.shape}")
print("z has values:")
print(tmp_z)
print()
print(f"h has shape {tmp_h.shape}")
print("h has values:")
print(tmp_h)
expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expected = numpy.array(
[[0.55379268],
[1.58960774],
[1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)
expect(tmp_h.shape).to(equal((2, 1)))
expected = numpy.array(
[[0.92477674],
[1.02487333]]
)
expect(numpy.allclose(tmp_h, expected)).to(be_true)
x has shape (3, 1) N is 2 and vocabulary size V is 3 call forward_prop z has shape (3, 1) z has values: [[0.55379268] [1.58960774] [1.50722933]] h has shape (2, 1) h has values: [[0.92477674] [1.02487333]]
Pack Index with Frequency
def index_with_frequency(context_words: list,
word_to_index: dict) -> list:
"""combines indexes and frequency counts-dict
Args:
context_words: words to get the indices for
word_to_index: mapping of word to index
Returns:
list of (word-index, word-count) tuples built from context_words
"""
frequency_dict = Counter(context_words)
indices = [word_to_index[word] for word in context_words]
packed = []
for index in range(len(indices)):
word_index = indices[index]
frequency = frequency_dict[context_words[index]]
packed.append((word_index, frequency))
return packed
Vector Generator
def vectors(data: numpy.ndarray, word_to_index: dict, half_window: int):
"""Generates vectors of fraction of context words each word represents
Args:
data: source of the vectors
word_to_index: mapping of word to index in the vocabulary
half_window: number of tokens on either side of the word to keep
Yields:
tuple of x, y
"""
location = half_window
vocabulary_size = len(word_to_index)
while True:
y = numpy.zeros(vocabulary_size)
x = numpy.zeros(vocabulary_size)
center_word = data[location]
y[word_to_index[center_word]] = 1
context_words = (data[(location - half_window): location]
+ data[(location + 1) : (location + half_window + 1)])
for word_index, frequency in index_with_frequency(context_words, word_to_index):
x[word_index] = frequency/len(context_words)
yield x, y
location += 1
if location >= len(data):
print("location in data is being set to 0")
location = 0
return
Batch Generator
This uses a not so common form of the while loop. Whenever you run a loop and it reaches the end (so you didn't break it) then it will run the else
clause.
def batch_generator(data: numpy.ndarray, word_to_index: dict,
half_window: int, batch_size: int, original: bool=True):
"""Generate batches of vectors
Args:
data: the training data
word_to_index: map of word to vocabulary index
half_window: number of tokens to take from either side of word
batch_size: Number of vectors to put in each training batch
original: run the original buggy code
Yields:
tuple of X, Y batches
"""
vocabulary_size = len(word_to_index)
batch_x = []
batch_y = []
for x, y in vectors(data,
word_to_index,
half_window):
if original:
while len(batch_x) < batch_size:
batch_x.append(x)
batch_y.append(y)
else:
yield numpy.array(batch_x).T, numpy.array(batch_y).T
else:
if len(batch_x) < batch_size:
batch_x.append(x)
batch_y.append(y)
else:
yield numpy.array(batch_x).T, numpy.array(batch_y).T
batch_x = []
batch_y = []
return
So every time batch_x
reaches the batch_size
it yields the tuple and then creates a new batch before continuing the outer for-loop.
Cost function
The cross-entropy loss function.
def compute_cost(y: numpy.ndarray, y_hat: numpy.ndarray,
batch_size: int) -> numpy.ndarray:
"""Calculates the cross-entropy loss
Args:
y: array with the actual words labeled
y_hat: our model's guesses for the words
batch_size: the number of examples per training run
"""
log_probabilities = (numpy.multiply(numpy.log(y_hat), y)
+ numpy.multiply(numpy.log(1 - y_hat), 1 - y))
cost = -numpy.sum(log_probabilities)/batch_size
cost = numpy.squeeze(cost)
return cost
Test the function
tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)
tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))
print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")
tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")
tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")
tmp_cost = compute_cost(tmp_y, tmp_yhat, tmp_batch_size)
print("call compute_cost")
print(f"tmp_cost {tmp_cost:.4f}")
expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)
tmp_x.shape (5778, 4) tmp_y.shape (5778, 4) tmp_W1.shape (50, 5778) tmp_W2.shape (5778, 50) tmp_b1.shape (50, 1) tmp_b2.shape (5778, 1) tmp_z.shape: (5778, 4) tmp_h.shape: (50, 4) tmp_yhat.shape: (5778, 4) call compute_cost tmp_cost 9.9560
Training the Model - Backpropagation
Now that you have understood how the CBOW model works, you will train it. You created a function for the forward propagation. Now you will implement a function that computes the gradients to backpropagate the errors.
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: back_prop
def back_prop(x: numpy.ndarray,
yhat: numpy.ndarray,
y: numpy.ndarray,
h: numpy.ndarray,
W1: numpy.ndarray,
W2: numpy.ndarray,
b1: numpy.ndarray,
b2: numpy.ndarray,
batch_size: int) -> tuple:
"""Calculates the gradients
Args:
x: average one hot vector for the context
yhat: prediction (estimate of y)
y: target vector
h: hidden vector (see eq. 1)
W1, W2, b1, b2: matrices and biases
batch_size: batch size
Returns:
grad_W1, grad_W2, grad_b1, grad_b2: gradients of matrices and biases
"""
### START CODE HERE (Replace instances of 'None' with your code) ###
# Compute l1 as W2^T (Yhat - Y)
# Re-use it whenever you see W2^T (Yhat - Y) used to compute a gradient
l1 = numpy.dot(W2.T, yhat - y)
# Apply relu to l1
l1 = numpy.maximum(l1, 0)
# Compute the gradient of W1
grad_W1 = numpy.dot(l1, x.T)/batch_size
# Compute the gradient of W2
grad_W2 = numpy.dot(yhat - y, h.T)/batch_size
# Compute the gradient of b1
grad_b1 = numpy.sum(l1, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
# Compute the gradient of b2
grad_b2 = numpy.sum(yhat - y, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
### END CODE HERE ###
return grad_W1, grad_W2, grad_b1, grad_b2
Test the function
tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)
# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))
print("get a batch of data")
print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")
print()
print("Initialize weights and biases")
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")
print()
print("Forwad prop to get z and h")
tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")
print()
print("Get yhat by calling softmax")
tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")
tmp_m = (2*tmp_C)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)
print()
print("call back_prop")
print(f"tmp_grad_W1.shape {tmp_grad_W1.shape}")
print(f"tmp_grad_W2.shape {tmp_grad_W2.shape}")
print(f"tmp_grad_b1.shape {tmp_grad_b1.shape}")
print(f"tmp_grad_b2.shape {tmp_grad_b2.shape}")
expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))
get a batch of data tmp_x.shape (5778, 4) tmp_y.shape (5778, 4) Initialize weights and biases tmp_W1.shape (50, 5778) tmp_W2.shape (5778, 50) tmp_b1.shape (50, 1) tmp_b2.shape (5778, 1) Forwad prop to get z and h tmp_z.shape: (5778, 4) tmp_h.shape: (50, 4) Get yhat by calling softmax tmp_yhat.shape: (5778, 4) call back_prop tmp_grad_W1.shape (50, 5778) tmp_grad_W2.shape (5778, 50) tmp_grad_b1.shape (50, 1) tmp_grad_b2.shape (5778, 1)
Gradient Descent
Now that you have implemented a function to compute the gradients, you will implement batch gradient descent over your training set.
Hint: For that, you will use initialize_model
and the back_prop
functions which you just created (and the compute_cost
function). You can also use the provided get_batches
helper function:
Also: print the cost after each batch is processed (use batch size = 128).
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: gradient_descent
def gradient_descent(data: numpy.ndarray, word2Ind: dict, N: int, V: int ,
num_iters: int, alpha=0.03):
"""
This is the gradient_descent function
Args:
data: text
word2Ind: words to Indices
N: dimension of hidden vector
V: dimension of vocabulary
num_iters: number of iterations
Returns:
W1, W2, b1, b2: updated matrices and biases
"""
W1, W2, b1, b2 = initialize_model(N,V, random_seed=282)
batch_size = 128
iters = 0
C = 2
for x, y in batch_generator(data, word2Ind, C, batch_size):
### START CODE HERE (Replace instances of 'None' with your own code) ###
# Get z and h
z, h = forward_prop(x, W1, W2, b1, b2)
# Get yhat
yhat = softmax(z)
# Get cost
cost = compute_cost(y, yhat, batch_size)
if ( (iters+1) % 10 == 0):
print(f"iters: {iters + 1} cost: {cost:.6f}")
# Get gradients
grad_W1, grad_W2, grad_b1, grad_b2 = back_prop(x,
yhat,
y,
h,
W1,
W2,
b1,
b2,
batch_size)
# Update weights and biases
W1 = W1 - alpha * grad_W1
W2 = W2 - alpha * grad_W2
b1 = b1 - alpha * grad_b1
b2 = b2 - alpha * grad_b2
### END CODE HERE ###
iters += 1
if iters == num_iters:
break
if iters % 100 == 0:
alpha *= 0.66
return W1, W2, b1, b2
Test Your Function
C = 2
N = 50
V = len(meta.vocabulary)
num_iters = 150
print("Call gradient_descent")
W1, W2, b1, b2 = gradient_descent(data, meta.word_to_index, N, V, num_iters)
Call gradient_descent iters: 10 cost: 0.789141 iters: 20 cost: 0.105543 iters: 30 cost: 0.056008 iters: 40 cost: 0.038101 iters: 50 cost: 0.028868 iters: 60 cost: 0.023237 iters: 70 cost: 0.019444 iters: 80 cost: 0.016716 iters: 90 cost: 0.014660 iters: 100 cost: 0.013054 iters: 110 cost: 0.012133 iters: 120 cost: 0.011370 iters: 130 cost: 0.010698 iters: 140 cost: 0.010100 iters: 150 cost: 0.009566
End
The next post is one on extracting and visualizing the embeddings using Principal Component Analysis.
Bundling It Up
Imports
# python
from collections import Counter, namedtuple
from enum import Enum, unique
# pypi
import attr
import numpy
Enum Setup
@unique
class Axis(Enum):
ROWS = 0
COLUMNS = 1
Named Tuples
Gradients = namedtuple("Gradients", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])
Weights = namedtuple("Weights", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])
The CBOW Model
@attr.s(auto_attribs=True)
class CBOW:
"""A continuous bag of words model builder
Args:
hidden: number of rows in the hidden layer
vocabulary_size: number of tokens in the vocabulary
learning_rate: learning rate for back-propagation updates
random_seed: int
"""
hidden: int
vocabulary_size: int
learning_rate: float=0.03
random_seed: int=1
_random_generator: numpy.random.PCG64=None
# layer one
_input_weights: numpy.ndarray=None
_input_bias: numpy.ndarray=None
# hidden layer
_hidden_weights: numpy.ndarray=None
_hidden_bias: numpy.ndarray=None
- The Random Generator
@property def random_generator(self) -> numpy.random.PCG64: """The random number generator""" if self._random_generator is None: self._random_generator = numpy.random.default_rng(self.random_seed) return self._random_generator
- First Layer Weights
These are initialized using numpy's new generator. I originally using their standard-normal version by mistake and the model did horrible. Using the Generator.random gives you a uniform distribution which seems to be what you're supposed to use.
@property def input_weights(self) -> numpy.ndarray: """Weights for the first layer""" if self._input_weights is None: self._input_weights = self.random_generator.random( (self.hidden, self.vocabulary_size)) return self._input_weights
- First Layer Bias
@property def input_bias(self) -> numpy.ndarray: """Bias for the input layer""" if self._input_bias is None: self._input_bias = self.random_generator.random( (self.hidden, 1) ) return self._input_bias
- Hidden Layer Weights
@property def hidden_weights(self) -> numpy.ndarray: """The weights for the hidden layer""" if self._hidden_weights is None: self._hidden_weights = self.random_generator.random( (self.vocabulary_size, self.hidden) ) return self._hidden_weights
- Hidden Layer Bias
@property def hidden_bias(self) -> numpy.ndarray: """Bias for the hidden layer""" if self._hidden_bias is None: self._hidden_bias = self.random_generator.random( (self.vocabulary_size, 1) ) return self._hidden_bias
- Softmax
def softmax(self, scores: numpy.ndarray) -> numpy.ndarray: """Calculate the softmax Args: scores: output scores from the hidden layer Returns: yhat: prediction (estimate of y)""" return numpy.exp(scores)/numpy.sum(numpy.exp(scores), axis=Axis.ROWS.value)
- Forward Propagation
def forward(self, data: numpy.ndarray) -> tuple: """makes a model prediction Args: data: x-values to train on Returns: output, first-layer output """ first_layer_output = numpy.maximum(numpy.dot(self.input_weights, data) + self.input_bias, 0) second_layer_output = (numpy.dot(self.hidden_weights, first_layer_output) + self.hidden_bias) return second_layer_output, first_layer_output
- Gradients
def gradients(self, data: numpy.ndarray, predicted: numpy.ndarray, actual: numpy.ndarray, hidden_input: numpy.ndarray) -> Gradients: """does the gradient calculation for back-propagation This is broken out to be able to troubleshoot/compare it Args: data: the input x value predicted: what our model predicted the labels for the data should be actual: what the actual labels should have been hidden_input: the input to the hidden layer Returns: Gradients for input_weight, hidden_weight, input_bias, hidden_bias """ difference = predicted - actual batch_size = difference.shape[1] l1 = numpy.maximum(numpy.dot(self.hidden_weights.T, difference), 0) input_weights_gradient = numpy.dot(l1, data.T)/batch_size hidden_weights_gradient = numpy.dot(difference, hidden_input.T)/batch_size input_bias_gradient = numpy.sum(l1, axis=Axis.COLUMNS.value, keepdims=True)/batch_size hidden_bias_gradient = numpy.sum(difference, axis=Axis.COLUMNS.value, keepdims=True)/batch_size return Gradients(input_weights=input_weights_gradient, hidden_weights=hidden_weights_gradient, input_bias=input_bias_gradient, hidden_bias=hidden_bias_gradient)
- Backward Propagation
def backward(self, data: numpy.ndarray, predicted: numpy.ndarray, actual: numpy.ndarray, hidden_input: numpy.ndarray) -> None: """Does back-propagation to update the weights Arg:s data: the input x value predicted: what our model predicted the labels for the data should be actual: what the actual labels should have been hidden_input: the input to the hidden layer """ gradients = self.gradients(data=data, predicted=predicted, actual=actual, hidden_input=hidden_input) # I don't have setters for the properties so use the private variables self._input_weights -= self.learning_rate * gradients.input_weights self._hidden_weights -= self.learning_rate * gradients.hidden_weights self._input_bias -= self.learning_rate * gradients.input_bias self._hidden_bias -= self.learning_rate * gradients.hidden_bias return
- Call
def __call__(self, data: numpy.ndarray) -> numpy.ndarray: """makes a prediction on the data Args: data: input data for the prediction Returns: softmax of model output """ output, _ = self.forward(data) return self.softmax(output)
Batch Generator
@attr.s(auto_attribs=True)
class Batches:
"""Generates batches of data
Args:
data: the source of the data to generate (training data)
word_to_index: dict mapping the word to the vocabulary index
half_window: number of tokens on either side of word to grab
batch_size: the number of entries per batch
batches: number of batches to generate before quitting
verbose: whether to emit messages
"""
data: numpy.ndarray
word_to_index: dict
half_window: int
batch_size: int
batches: int
repetitions: int=0
verbose: bool=False
_vocabulary_size: int=None
_vectors: object=None
- Vocabulary Size
@property def vocabulary_size(self) -> int: """Number of tokens in the vocabulary""" if self._vocabulary_size is None: self._vocabulary_size = len(self.word_to_index) return self._vocabulary_size
- Vectors
@property def vectors(self): """our vector-generator started up""" if self._vectors is None: self._vectors = self.vector_generator() return self._vectors
- Indices and Frequencies
def indices_and_frequencies(self, context_words: list) -> list: """combines word-indexes and frequency counts-dict Args: context_words: words to get the indices for Returns: list of (word-index, word-count) tuples built from context_words """ frequencies = Counter(context_words) indices = [self.word_to_index[word] for word in context_words] return [(indices[index], frequencies[context_words[index]]) for index in range(len(indices))]
- Vectors
def vector_generator(self): """Generates vectors infinitely x: fraction of context words represented by word y: array with 1 where center word is in the vocabulary and 0 elsewhere Yields: tuple of x, y """ location = self.half_window while True: y = numpy.zeros(self.vocabulary_size) x = numpy.zeros(self.vocabulary_size) center_word = self.data[location] y[self.word_to_index[center_word]] = 1 context_words = ( self.data[(location - self.half_window): location] + self.data[(location + 1) : (location + self.half_window + 1)]) for word_index, frequency in self.indices_and_frequencies(context_words): x[word_index] = frequency/len(context_words) yield x, y location += 1 if location >= len(self.data): if self.verbose: print("location in data is being set to 0") location = 0 return
- Iterator Method
def __iter__(self): """makes this into an iterator""" return self
- Next Method
def __next__(self) -> tuple: """Creates the batches and returns them Returns: x, y batches """ batch_x = [] batch_y = [] if self.repetitions == self.batches: raise StopIteration() self.repetitions += 1 for x, y in self.vectors: if len(batch_x) < self.batch_size: batch_x.append(x) batch_y.append(y) else: return numpy.array(batch_x).T, numpy.array(batch_y).T return
The Trainer
@attr.s(auto_attribs=True)
class TheTrainer:
"""Something to train the model
Args:
model: thing to train
batches: batch generator
learning_impairment: rate to slow the model's learning
impairment_point: how frequently to impair the learner
emit_point: how frequently to emit messages
verbose: whether to emit messages
"""
model: CBOW
batches: Batches
learning_impairment: float=0.66
impairment_point: int=100
emit_point: int=10
verbose: bool=False
_losses: list=None
- Losses
@property def losses(self) -> list: """Holder for the training losses""" if self._losses is None: self._losses = [] return self._losses
- Gradient Descent
def __call__(self): """Trains the model using gradient descent """ self.best_loss = float("inf") for repetitions, x_y in enumerate(self.batches): x, y = x_y output, hidden_input = self.model.forward(x) predictions = self.model.softmax(output) loss = self.cross_entropy_loss(predicted=predictions, actual=y) if loss < self.best_loss: self.best_loss = loss self.best_weights = Weights( self.model.input_weights.copy(), self.model.hidden_weights.copy(), self.model.input_bias.copy(), self.model.hidden_bias.copy(), ) self.losses.append(loss) self.model.backward(data=x, predicted=predictions, actual=y, hidden_input=hidden_input) if ((repetitions + 1) % self.impairment_point) == 0: self.model.learning_rate *= self.learning_impairment if self.verbose: print(f"new learning rate: {self.model.learning_rate}") if self.verbose and ((repetitions + 1) % self.emit_point == 0): print(f"{repetitions + 1}: loss={self.losses[repetitions]}") return
- Cross-Entropy-Loss
def cross_entropy_loss(self, predicted: numpy.ndarray, actual: numpy.ndarray) -> numpy.ndarray: """Calculates the cross-entropy loss Args: predicted: array with the model's guesses actual: array with the actual labels Returns: the cross-entropy loss """ log_probabilities = (numpy.multiply(numpy.log(predicted), actual) + numpy.multiply(numpy.log(1 - predicted), 1 - actual)) cost = -numpy.sum(log_probabilities)/self.batches.batch_size return numpy.squeeze(cost)
Testing It
from neurotic.nlp.word_embeddings import Batches, CBOW, TheTrainer
N = 4
V = len(meta.vocabulary)
model = CBOW(hidden=N, vocabulary_size=V)
expect(model.vocabulary_size).to(equal(V))
expect(model.input_weights.shape).to(equal((N, V)))
expect(model.hidden_weights.shape).to(equal((V, N)))
expect(model.input_bias.shape).to(equal((N, 1)))
expect(model.hidden_bias.shape).to(equal((V, 1)))
tmp = numpy.array([[1,2,3],
[1,1,1]
])
tmp_sm = model.softmax(tmp)
expected = numpy.array([[0.5, 0.73105858, 0.88079708],
[0.5, 0.26894142, 0.11920292]])
expect(numpy.allclose(tmp_sm, expected)).to(be_true)
Forward Propagation
tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T
model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2
tmp_z, tmp_h = model.forward(tmp_x)
expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expect(tmp_h.shape).to(equal((2, 1)))
expected = numpy.array(
[[0.55379268],
[1.58960774],
[1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)
expected = numpy.array(
[[0.92477674],
[1.02487333]]
)
expect(numpy.allclose(tmp_h, expected)).to(be_true)
Cross Entropy Loss
tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
batches=15,
half_window=tmp_C, batch_size=tmp_batch_size)
tmp_V = len(meta.vocabulary)
tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2
tmp_z, tmp_h = model.forward(tmp_x)
tmp_yhat = model.softmax(tmp_z)
train = TheTrainer(model=model, batches=batches, verbose=True)
tmp_cost = train.cross_entropy_loss(actual=tmp_y, predicted=tmp_yhat)
expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)
Back Propagation
tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2
tmp_z, tmp_h = model.forward(tmp_x)
tmp_yhat = model.softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")
gradients = model.gradients(data=tmp_x, predicted=tmp_yhat, actual=tmp_y, hidden_input=tmp_h)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)
expect(numpy.allclose(gradients.input_weights, tmp_grad_W1)).to(be_true)
expect(numpy.allclose(gradients.hidden_weights, tmp_grad_W2)).to(be_true)
expect(numpy.allclose(gradients.input_bias, tmp_grad_b1)).to(be_true)
expect(numpy.allclose(gradients.hidden_bias, tmp_grad_b2)).to(be_true)
expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))
Putting Some Stuff Together
tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
hidden_layers = 50
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
batches=15,
half_window=tmp_C, batch_size=tmp_batch_size)
tmp_x, tmp_y = next(batches)
model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
prediction = model(tmp_x)
train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))
# using their initial weights
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
expect(model.input_weights.shape).to(equal(tmp_W1.shape))
expect(model.hidden_weights.shape).to(equal(tmp_W2.shape))
expect(model.input_bias.shape).to(equal(tmp_b1.shape))
expect(model.hidden_bias.shape).to(equal(tmp_b2.shape))
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2
tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
prediction = model(tmp_x)
train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))
11.871189103548419 11.871189103548419 9.956016099656951 9.956016099656951
I changed the weights to use the uniform distribution which seems to work better, but weirdly it still does a little worse initially. The random-seed seems to be different for the old numpy random and their new generator.
The Batches
The original batch-generator had a couple of bugs in it. To avoid them pass in original=True
.
tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
batches=5,
half_window=tmp_C, batch_size=tmp_batch_size)
old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C,
tmp_batch_size, original=False)
old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
expect(numpy.allclose(tmp_x, old_x)).to(be_true)
expect(numpy.allclose(tmp_y, old_y)).to(be_true)
old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
#expect(numpy.allclose(tmp_x, old_x)).to(be_true)
#expect(numpy.allclose(tmp_y, old_y)).to(be_true)
old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
Gradient Descent
hidden_layers = 50
half_window = 2
batch_size = 128
repetitions = 150
model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=batch_size, batches=repetitions)
train = TheTrainer(model=model, batches=batches, verbose=True)
train()
10: loss=12.949165499168524 20: loss=7.1739091478289225 30: loss=13.431976455238479 40: loss=4.0062314323745545 50: loss=11.595407087927406 60: loss=10.41983077447342 70: loss=7.843047289924249 80: loss=12.529314536141994 90: loss=14.122707806423126 new learning rate: 0.0198 100: loss=10.80530164111974 110: loss=4.624869443165228 120: loss=5.552813055551899 130: loss=8.483428176366933 140: loss=9.047299388851195 150: loss=4.841072955589429
Gradient Re-do
Something's wrong with the trainer's gradient descent so I'm going to try and update the original function to do it.
def grady_the_ent(model: CBOW, data: numpy.ndarray,
num_iters: int, batches: Batches, alpha=0.03):
"""This is the gradient_descent function
Args:
data: text
word2Ind: words to Indices
N: dimension of hidden vector
V: dimension of vocabulary
num_iters: number of iterations
Returns:
W1, W2, b1, b2: updated matrices and biases
"""
batch_size = 128
iters = 0
C = 2
for x, y in batches:
z, h = model.forward(x)
# Get yhat
yhat = model.softmax(z)
# Get cost
cost = compute_cost(y, yhat, batch_size)
if ((iters+1) % 10 == 0):
print(f"iters: {iters + 1} cost: {cost:.6f}")
grad_W1, grad_W2, grad_b1, grad_b2 = model.gradients(x,
yhat,
y,
h)
# Update weights and biases
model._input_weights -= alpha * grad_W1
model._hidden_weights -= alpha * grad_W2
model._input_bias -= alpha * grad_b1
model._hidden_bias -= alpha * grad_b2
### END CODE HERE ###
iters += 1
if iters == num_iters:
break
if iters % 100 == 0:
alpha *= 0.66
return
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)
model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
# batch_generator(data, word2Ind, C, batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=batch_size, batches=repetitions)
grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 12.949165 iters: 20 cost: 7.173909 iters: 30 cost: 13.431976 iters: 40 cost: 4.006231 iters: 50 cost: 11.595407 iters: 60 cost: 10.419831 iters: 70 cost: 7.843047 iters: 80 cost: 12.529315 iters: 90 cost: 14.122708 iters: 100 cost: 10.805302 iters: 110 cost: 4.624869 iters: 120 cost: 5.552813 iters: 130 cost: 8.483428 iters: 140 cost: 9.047299 iters: 150 cost: 4.841073
So, something's wrong with the gradient descent.
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)
model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
# half_window=half_window, batch_size=batch_size, batches=repetitions)
grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 0.407862 iters: 20 cost: 0.090807 iters: 30 cost: 0.050924 iters: 40 cost: 0.035379 iters: 50 cost: 0.027105 iters: 60 cost: 0.021969 iters: 70 cost: 0.018470 iters: 80 cost: 0.015932 iters: 90 cost: 0.014008 iters: 100 cost: 0.012499 iters: 110 cost: 0.011631 iters: 120 cost: 0.010911 iters: 130 cost: 0.010274 iters: 140 cost: 0.009708 iters: 150 cost: 0.009201
It looks like it's the batches.
Troubleshooting the Batches
half_window = 2
batch_size = 128
repetitions = 150
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=batch_size, batches=repetitions)
start = random.randint(0, 100)
context = cleaner.processed[start: start + half_window] + cleaner.processed[start + half_window + 1: start + half_window * 2]
packed_1 = index_with_frequency(context, meta.word_to_index)
packed_2 = batches.indices_and_frequencies(context)
expect(packed_1).to(contain_exactly(*packed_2))
So the indices and frequencies is okay.
half_window = 2
v = vectors(cleaner.processed, meta.word_to_index, half_window)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=batch_size, batches=repetitions)
repetition = 0
for old, new in zip(v, batches.vectors):
expect((old[0] == new[0]).all()).to(equal(True))
expect((old[1] == new[1]).all()).to(equal(True))
repetition += 1
if repetition == repetitions:
break
And the vectors look okay.
old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)
repetitions = 150
repetition = 0
# batch = next(batches)
for old in old_generator:
batch_x = []
batch_y = []
for x, y in batches.vectors:
while len(batch_x) < batches.batch_size:
batch_x.append(x)
batch_y.append(y)
else:
newx, newy = numpy.array(batch_x).T, numpy.array(batch_y).T
expect((old[0]==newx).all()).to(equal(True))
repetition += 1
if repetition == repetitions:
break
else:
continue
break
So, weirdly, rolling the __next__=
by hand seems to work.
old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)
repetition, repetitions = 0, 150
for old, new in zip(old_generator, batches):
try:
expect((old[0] == new[0]).all()).to(equal(True))
expect((old[1] == new[1]).all()).to(equal(True))
except AssertionError:
print(repetition)
break
repetition += 1
if repetition == repetitions:
break
1
But not the batches.
old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)
repetition, repetitions = 0, 150
for old in old_generator:
new = next(batches)
expect(old[0].shape).to(equal(new[0].shape))
try:
expect((old[0] == new[0]).all()).to(equal(True))
expect((old[1] == new[1]).all()).to(equal(True))
except AssertionError:
print(repetition)
break
repetition += 1
if repetition == repetitions:
break
Actually, it looks like the old generator might be broken.
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)
model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
# half_window=half_window, batch_size=batch_size, batches=repetitions)
grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 12.949165 iters: 20 cost: 7.173909 iters: 30 cost: 13.431976 iters: 40 cost: 4.006231 iters: 50 cost: 11.595407 iters: 60 cost: 10.419831 iters: 70 cost: 7.843047 iters: 80 cost: 12.529315 iters: 90 cost: 14.122708 iters: 100 cost: 10.805302 iters: 110 cost: 4.624869 iters: 120 cost: 5.552813 iters: 130 cost: 8.483428 iters: 140 cost: 9.047299 iters: 150 cost: 4.841073
The old generator wasn't creating new lists every time so it was just fitting the same batch of data every time… in fact it had a while loop instead of a conditional so it was just creating one batch with the same x and y lists repeated over and over so it should really be the worse performance, not the really good performance the original generator gave. I didn't re-run the ones above but this next set is being run after fixing my implementation.
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 300
vocabulary_size = len(meta.vocabulary)
model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=batch_size, batches=repetitions)
trainer = TheTrainer(model, batches, emit_point=50)
with TIMER:
trainer()
2020-12-16 14:15:54,530 graeae.timers.timer start: Started: 2020-12-16 14:15:54.530779 2020-12-16 14:16:18,600 graeae.timers.timer end: Ended: 2020-12-16 14:16:18.600880 2020-12-16 14:16:18,602 graeae.timers.timer end: Elapsed: 0:00:24.070101
print(trainer.losses[0], trainer.losses[-1])
11.99601105791401 8.827228045367379
Not a huge improvement, but it didn't run for a long time either.
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 1000
vocabulary_size = len(meta.vocabulary)
model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
half_window=half_window, batch_size=batch_size, batches=repetitions)
trainer = TheTrainer(model, batches, emit_point=100, verbose=True)
with TIMER:
trainer()
2020-12-16 14:40:13,275 graeae.timers.timer start: Started: 2020-12-16 14:40:13.275964 new learning rate: 0.0198 100: loss=9.138356897918037 new learning rate: 0.013068000000000001 200: loss=9.077599951734605 new learning rate: 0.008624880000000001 300: loss=8.827228045367379 new learning rate: 0.005692420800000001 400: loss=8.556788482755191 new learning rate: 0.003756997728000001 500: loss=8.92744766914796 new learning rate: 0.002479618500480001 600: loss=9.052677036205138 new learning rate: 0.0016365482103168007 700: loss=8.914532962726918 new learning rate: 0.0010801218188090885 800: loss=8.885698480310062 new learning rate: 0.0007128804004139984 900: loss=9.042620463323736 2020-12-16 14:41:33,457 graeae.timers.timer end: Ended: 2020-12-16 14:41:33.457065 2020-12-16 14:41:33,458 graeae.timers.timer end: Elapsed: 0:01:20.181101 new learning rate: 0.000470501064273239 1000: loss=9.239992952104755
Hmm… doesn't seem to be improving.
losses = pandas.Series(trainer.losses)
line = holoviews.VLine(losses.idxmin()).opts(color=Plot.blue)
time_series = losses.hvplot().opts(title="Loss per Repetition",
width=Plot.width, height=Plot.height,
color=Plot.tan)
plot = time_series * line
output = Embed(plot=plot, file_name="training_1000")()
print(output)
Since the losses are in a Series we can use its idxmin method to see when the losses bottomed out.
print(losses.idxmin())
247
print(losses.loc[247], losses.iloc[-1])
8.186490214727549 9.239992952104755
So it did the best at 247 and then got a little worse as we went along.
print(len(meta.word_to_index)/batch_size)
45.140625
We exhausted our data after 45 batches so I guess it's overfitting after a while.