Data Generators

Data generators

In Python, a generator is a function that behaves like an iterator. It will return the next item. In many AI applications, it is advantageous to have a data generator to handle loading and transforming data for different applications.

In the following example, we use a set of samples a, to derive a new set of samples, with more elements than the original set.

Note: Pay attention to the use of list lines_index and variable index to traverse the original list.

Imports

# python
from itertools import cycle

import random

# pypi
from expects import be_true, expect
import numpy

Examples

An Example of a Circular List

This is sort of a fake generator that uses indices to make it look like it's infinite.

a = [1, 2, 3, 4]
a_size = len(a)
end = 10
index = 0                      # similar to index in data_generator below
for i in range(10):        # `b` is longer than `a` forcing a wrap   
    print(a[index], end=",")
    index = (index + 1) % a_size    
1,2,3,4,1,2,3,4,1,2,

There's a python built-in that's equivalent to this called cycle.

index = 1
for item in cycle(a):
    print(item, end=",")
    if index == end:
        break
    index += 1    
1,2,3,4,1,2,3,4,1,2,

And if you wanted to make your own generator version you could use the yield keyword.

def infinite(a: list):
    """Generates elements infinitely

    Args:
     a: list

    Yields:
     elements of a
    """
    index = 0
    end = len(a)
    while True:
        yield a[index]
        index = (index + 1) % end
    return

a_infinite = infinite(a)
for index, item in enumerate(a_infinite):
    if index == end:
        break
    print(item, end=",")
1,2,3,4,1,2,3,4,1,2,

Shuffling the data order

In the next example, we will do the same as before, but shuffling the order of the elements in the output list. Note that here, our strategy of traversing using lines_index and index becomes very important, because we can simulate a shuffle in the input data, without doing that in reality.

a = tuple((1, 2, 3, 4))
a_size = len(a)
data_indices = list(range(a_size))
print(f"Original order of indices: {data_indices}")
Original order of indices: [0, 1, 2, 3]

If we shuffle the index_list we can change the order of our circular list without modifying the order or our original data.

random.shuffle(data_indices) # Shuffle the order
print(f"Shuffled order of indices: {data_indices}")
Shuffled order of indices: [3, 0, 1, 2]

Now we create a list of random values from a that is larger than a.

b = [a[index] for index in data_indices]
b_size = 10

print(f"New value order for first batch: {b}")
batch_counter = 1
data_index = 0
for b_index in range(len(b), b_size):
    if data_index == 0:
        batch_counter += 1
        random.shuffle(data_indices)
        print(f"\nShuffled Indexes for Batch No. {batch_counter} :{data_indices}")
        print(f"Values for Batch No.{batch_counter} :{[a[index] for index in data_indices]}")

    b.append(a[data_indices[data_index]])
    data_index = (data_index + 1) % a_size

print(f"\nFinal value of b: {b} with {len(b)} items")
New value order for first batch: [1, 3, 4, 2]

Shuffled Indexes for Batch No. 2 :[1, 3, 2, 0]
Values for Batch No.2 :[2, 4, 3, 1]

Shuffled Indexes for Batch No. 3 :[0, 3, 2, 1]
Values for Batch No.3 :[1, 4, 3, 2]

Final value of b: [1, 3, 4, 2, 2, 4, 3, 1, 1, 4] with 10 items

Note: We call an epoch each time that an algorithm passes over all the training examples. Shuffling the examples for each epoch is known to reduce variance, making the models more general and overfit less.

Using sample. instead.

data_indices = random.sample(range(a_size), k=a_size)
b = [a[index] for index in data_indices]
b_size = 10

print(f"New value order for first batch: {b}")
batch_counter = 1
data_index = 0
for b_index in range(len(b), b_size):
    if data_index == 0:
        batch_counter += 1
        data_indices = random.sample(data_indices, k=a_size)
        print(f"\nShuffled Indexes for Batch No. {batch_counter} :{data_indices}")
        print(f"Values for Batch No.{batch_counter} :{[a[index] for index in data_indices]}")

    b.append(a[data_indices[data_index]])
    data_index = (data_index + 1) % a_size

print(f"\nFinal value of b: {b} with {len(b)} items")
New value order for first batch: [1, 4, 3, 2]

Shuffled Indexes for Batch No. 2 :[3, 0, 1, 2]
Values for Batch No.2 :[4, 1, 2, 3]

Shuffled Indexes for Batch No. 3 :[2, 0, 1, 3]
Values for Batch No.3 :[3, 1, 2, 4]

Final value of b: [1, 4, 3, 2, 4, 1, 2, 3, 3, 1] with 10 items

Data Generator Function

This will be a data generator function that takes in batch_size, x, y shuffle where x could be a large list of samples, and y is a list of the tags associated with those samples. Return a subset of those inputs in a tuple of two arrays (X,Y). Each is an array of dimension (batch_size). If shuffle=True, the data will be traversed in a random form.

Which runs continuously in the fashion of generators, pausing when yielding the next values. We will generate a batch_size output on each pass of this loop.

It has an inner loop that stores the data samples in temporary lists (X, Y) which will be included in the next batch.

There are three slightly out-of-the-ordinary features to this function.

  1. The first is the use of a list of a predefined size to store the data for each batch. Using a predefined size list reduces the computation time if the elements in the array are of a fixed size, like numbers. If the elements are of different sizes, it is better to use an empty array and append one element at a time during the loop.
  2. The second is tracking the current location in the incoming lists of samples. Generators variables hold their values between invocations, so we create an index variable, initialize to zero, and increment by one for each sample included in a batch. However, we do not use the index to access the positions of the list of sentences directly. Instead, we use it to select one index from a list of indexes. In this way, we can change the order in which we traverse our original list, keeping untouched our original list.
  3. The third also relates to wrapping. Because batch_size and the length of the input lists are not aligned, gathering a batch_size group of inputs may involve wrapping back to the beginning of the input loop. In our approach, it is just enough to reset the index to 0. We can re-shuffle the list of indexes to produce different batches each time.
def data_generator(batch_size: int, data_x: list, data_y: list, shuffle: bool=True):
    """Infinite batch generator

      Args: 
       batch_size: the size to make batches
       data_x: list containing samples
       data_y: list containing labels
       shuffle: Shuffle the data order

      Yields:
       a tuple containing 2 elements:
       X - list of dim (batch_size) of samples
       Y - list of dim (batch_size) of labels
    """
    amount_of_data = len(data_x)
    assert amount_of_data == len(data_y)

    def re_shuffle(x):
        k = len(x)
        return random.sample(range(k), k=k)

    shuffler = re_shuffle if shuffle else lambda x: list(range(len(x)))
    source_indices = shuffler(data_x)

    source_location = 0
    while True:
        X = list(range(batch_size))
        Y = list(range(batch_size))

        for batch_location in range(batch_size):                            
            X[batch_location] = data_x[source_indices[source_location]]
            Y[batch_location] = data_y[source_indices[source_location]]
            source_location = (source_location + 1) % amount_of_data
            source_indices = (shuffler(data_x) if source_location == 0
                              else source_indices)            
        yield((X, Y))
    return
def test_data_generator() -> None:
    """Tests the un-shuffled version of the generator

    Raises:
     AssertionError: some value didn't match.
    """
    x = [1, 2, 3, 4]
    y = [xi ** 2 for xi in x]

    generator = data_generator(3, x, y, shuffle=False)
    for expected in (([1, 2, 3], [1, 4, 9]),
                     ([4, 1, 2], [16, 1, 4]),
                     ([3, 4, 1], [9, 16, 1]),
                     ([2, 3, 4], [4, 9, 16])):
        expect(numpy.allclose(next(generator), expected)).to(be_true)
    return
test_data_generator()

Introducing Trax

Background

This is going to be a first look at Trax a Deep Learning framework built by the Google Brain team.

Why Trax and not TensorFlow or PyTorch?

TensorFlow and PyTorch are both extensive frameworks that can do almost anything in deep learning. They offer a lot of flexibility, but that often means verbosity of syntax and extra time to code.

Trax is much more concise. It runs on a TensorFlow backend but allows you to train models with 1 line commands. Trax also runs end to end, allowing you to get data, model and train all with a single terse statement. This means you can focus on learning, instead of spending hours on the idiosyncrasies of a big framework's implementation.

Why not Keras then?

Keras is now part of Tensorflow itself from 2.0 onwards. Also, trax is good for implementing new state of the art algorithms like Transformers, Reformers, BERT because it is actively maintained by Google Brain Team for advanced deep learning tasks. It runs smoothly on CPUs,GPUs and TPUs as well with comparatively lesser modifications in code.

How to Code in Trax

Building models in Trax relies on 2 key concepts:- layers and combinators. Trax layers are simple objects that process data and perform computations. They can be chained together into composite layers using Trax combinators, allowing you to build layers and models of any complexity.

Trax, JAX, TensorFlow and Tensor2Tensor

You already know that Trax uses Tensorflow as a backend, but it also uses the JAX library to speed up computation too. You can view JAX as an enhanced and optimized version of numpy.

You import their version of numpy using import trax.fastmath.numpy. If you see this line, remember that when calling numpy you are really calling Trax’s version of numpy that is compatible with JAX.**

As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray. The documentation for JAX is here and specifically they have a page with the numpy functions implemented so far.

Tensor2Tensor is another name you might have heard. It started as an end to end solution much like how Trax is designed, but it grew unwieldy and complicated. So you can view Trax as the new improved version that operates much faster and simpler.

Installing Trax

Note that there is another library called TraX which is something different.

We're going to use Trax version 1.3.1 here, so to install it with pip:

pip install trax==1.3.1

Note the == for the version, not =. This is a very big install so maybe take a break after you run it. You aren't going to get the full benefit of JAX if you don't have CUDA set up can use TPUs so make sure to set up CUDA if you're not using google colab. I also had to install cmake to get trax to install.

Imports

# pypi
import numpy

from trax import layers
from trax import shapes
from trax import fastmath
  • Layers are the basic building blocks for Trax
  • shapes are used for data handling
  • fastmath is the JAX version of numpy that can run on GPUs and TPUs

Middle

Layers

Layers are the core building blocks in Trax - they are the base classes. They take inputs, compute functions/custom calculations and return outputs.

Relu Layer

First we'll build a ReLU activation function as a layer. A layer like this is one of the simplest types. Notice there is no object initialization so it works just like a math function.

Note: Activation functions are also layers in Trax, which might look odd if you have been using other frameworks for a longer time.

relu = layers.Relu()

You can inspect the properties of a layer:

print("-- Properties --")
print("name :", relu.name)
print("expected inputs :", relu.n_in)
print("promised outputs :", relu.n_out, "\n")
-- Properties --
name : Relu
expected inputs : 1
promised outputs : 1 

We'll make an input the layer using numpy.

x = numpy.array([-2, -1, 0, 1, 2])
print("-- Inputs --")
print("x :", x, "\n")
-- Inputs --
x : [-2 -1  0  1  2] 

And see what it puts out.

y = relu(x)
print("-- Outputs --")
print("y :", y)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
-- Outputs --
y : [0 0 0 1 2]

I don't know why but JAX doesn't thing I have a GPU, even though tensorflow does. This whole thing is a little messed up right now because the current release of tensorflow doesn't work on Ubuntu 20.10. I'm running it with the nightly build (2.5) but I have to install all the Trax dependencies one at a time or it will clobber the tensorflow installation with the older version (the one that doesn't work) so there's a lot of places for error.

Concatenate Layer

Now a layer that takes 2 inputs. Notice the change in the expected inputs property from 1 to 2.

First create a concatenate trax layer and check out its properties.

concatenate = layers.Concatenate()
print("-- Properties --")
print("name :", concatenate.name)
print("expected inputs :", concatenate.n_in)
print("promised outputs :", concatenate.n_out, "\n")
-- Properties --
name : Concatenate
expected inputs : 2
promised outputs : 1 

Now create the two inputs.

x1 = numpy.array([-10, -20, -30])
x2 = x1 / -10
print("-- Inputs --")
print("x1 :", x1)
print("x2 :", x2, "\n")
-- Inputs --
x1 : [-10 -20 -30]
x2 : [1. 2. 3.] 

And now feed the inputs through the concatenate layer.

y = concatenate([x1, x2])
print("-- Outputs --")
print("y :", y)
-- Outputs --
y : [-10. -20. -30.   1.   2.   3.]

Configuring Layers

You can change the default settings of layers. For example, you can change the expected inputs for a concatenate layer from 2 to 3 using the optional parameter n_items.

concatenate_three = layers.Concatenate(n_items=3)
print("-- Properties --")
print("name :", concatenate_three.name)
print("expected inputs :", concatenate_three.n_in)
print("promised outputs :", concatenate_three.n_out, "\n")
-- Properties --
name : Concatenate
expected inputs : 3
promised outputs : 1 

Create some inputs.

x1 = numpy.array([-10, -20, -30])
x2 = x1 / -10
x3 = x2 * 0.99
print("-- Inputs --")
print("x1 :", x1)
print("x2 :", x2)
print("x3 :", x3, "\n")
-- Inputs --
x1 : [-10 -20 -30]
x2 : [1. 2. 3.]
x3 : [0.99 1.98 2.97] 

And now do the concatenation.

y = concatenate_three([x1, x2, x3])
print("-- Outputs --")
print("y :", y)
-- Outputs --
y : [-10.   -20.   -30.     1.     2.     3.     0.99   1.98   2.97]

Layer Weights

Some layer types include mutable weights and biases that are used in computation and training. Layers of this type require initialization before use.

For example the LayerNorm layer calculates normalized data, that is also scaled by weights and biases. During initialization you pass the data shape and data type of the inputs, so the layer can initialize compatible arrays of weights and biases.

Initialize it.

norm = layers.LayerNorm()

Now some input data.

x = numpy.array([0, 1, 2, 3], dtype="float")

Use the input data signature to get the shape and type for the initializing weights and biases. We need to convert the input datatype from the usual ndarray to a trax ShapeDtype

norm.init(shapes.signature(x)) 
print("Normal shape:",x.shape, "Data Type:",type(x.shape))
print("Shapes Trax:",shapes.signature(x),"Data Type:",type(shapes.signature(x)))
Normal shape: (4,) Data Type: <class 'tuple'>
Shapes Trax: ShapeDtype{shape:(4,), dtype:float64} Data Type: <class 'trax.shapes.ShapeDtype'>

Here are its properties.

print("-- Properties --")
print("name :", norm.name)
print("expected inputs :", norm.n_in)
print("promised outputs :", norm.n_out)
-- Properties --
name : LayerNorm
expected inputs : 1
promised outputs : 1

And the weights and biases.

print("weights :", norm.weights[0])
print("biases :", norm.weights[1],)
weights : [1. 1. 1. 1.]
biases : [0. 0. 0. 0.]

We have our input array.

print("-- Inputs --")
print("x :", x)
-- Inputs --
x : [0. 1. 2. 3.]

So we can inspect what the layer did to it.

y = norm(x)
print("-- Outputs --")
print("y :", y)
-- Outputs --
y : [-1.3416404  -0.44721344  0.44721344  1.3416404 ]

If you look at it you can see that the positives cancel out the negatives, giving us a sum of 0. I don't know why that's the norm, but maybe it'll become obvious later.

Custom Layers

You can create your own custom layers too and define custom functions for computations by using layers.Fn. Let me show you how.

help(layers.Fn)
Help on function Fn in module trax.layers.base:

Fn(name, f, n_out=1)
    Returns a layer with no weights that applies the function `f`.
    
    `f` can take and return any number of arguments, and takes only positional
    arguments -- no default or keyword arguments. It often uses JAX-numpy (`jnp`).
    The following, for example, would create a layer that takes two inputs and
    returns two outputs -- element-wise sums and maxima:
    
        `Fn('SumAndMax', lambda x0, x1: (x0 + x1, jnp.maximum(x0, x1)), n_out=2)`
    
    The layer's number of inputs (`n_in`) is automatically set to number of
    positional arguments in `f`, but you must explicitly set the number of
    outputs (`n_out`) whenever it's not the default value 1.
    
    Args:
      name: Class-like name for the resulting layer; for use in debugging.
      f: Pure function from input tensors to output tensors, where each input
          tensor is a separate positional arg, e.g., `f(x0, x1) --> x0 + x1`.
          Output tensors must be packaged as specified in the `Layer` class
          docstring.
      n_out: Number of outputs promised by the layer; default value 1.
    
    Returns:
      Layer executing the function `f`.
  • Define a custom layer

    In this example we'll create a layer to calculate the input times 2.

    def double_it() -> layers.Fn:
        """A custom layer function that doubles any inputs
    
    
        Returns:
         a custom function that takes one numeric argument and doubles it
        """
        layer_name = "TimesTwo"
    
        # Custom function for the custom layer
        def func(x):
            return x * 2
    
        return layers.Fn(layer_name, func)
    
  • Test it
    double = double_it()
    
    print("-- Properties --")
    print("name :", double.name)
    print("expected inputs :", double.n_in)
    print("promised outputs :", double.n_out)
    
    -- Properties --
    name : TimesTwo
    expected inputs : 1
    promised outputs : 1
    
    x = numpy.array([1, 2, 3])
    print("-- Inputs --")
    print("x :", x, "\n")
    y = double(x)
    print("-- Outputs --")
    print("y :", y)
    
    -- Inputs --
    x : [1 2 3] 
    
    -- Outputs --
    y : [2 4 6]
    

Combinators

You can combine layers to build more complex layers. Trax provides a set of objects named combinator layers to make this happen. Combinators are themselves layers, so behavior commutes.

Serial Combinator

This is the most common and easiest to use. You could, for example, build a simple neural network by combining layers into a single layer using the Serial combinator. This new layer then acts just like a single layer, so you can inspect intputs, outputs and weights. Or even combine it into another layer! Combinators can then be used as trainable models. Try adding more layers.

Note:As you must have guessed, if there is serial combinator, there must be a parallel combinator as well. Do try to explore about combinators and other layers from the trax documentation and look at the repo to understand how these layers are written.

serial = layers.Serial(
    layers.LayerNorm(),
    layers.Relu(),
    double,
    layers.Dense(n_units=2),
    layers.Dense(n_units=1),
    layers.LogSoftmax() 
)
  • Initialization
    x = numpy.array([-2, -1, 0, 1, 2]) #input
    serial.init(shapes.signature(x))
    
    print("-- Serial Model --")
    print(serial,"\n")
    print("-- Properties --")
    print("name :", serial.name)
    print("sublayers :", serial.sublayers)
    print("expected inputs :", serial.n_in)
    print("promised outputs :", serial.n_out)
    print("weights & biases:", serial.weights, "\n")
    
    -- Serial Model --
    Serial[
      LayerNorm
      Relu
      TimesTwo
      Dense_2
      Dense_1
      LogSoftmax
    ] 
    
    -- Properties --
    name : Serial
    sublayers : [LayerNorm, Relu, TimesTwo, Dense_2, Dense_1, LogSoftmax]
    expected inputs : 1
    promised outputs : 1
    weights & biases: [(DeviceArray([1, 1, 1, 1, 1], dtype=int32), DeviceArray([0, 0, 0, 0, 0], dtype=int32)), (), (), (DeviceArray([[ 0.19178385,  0.1832077 ],
                 [-0.36949775, -0.03924937],
                 [ 0.43800744,  0.788491  ],
                 [ 0.43107533, -0.3623491 ],
                 [ 0.6186575 ,  0.04764405]], dtype=float32), DeviceArray([-3.0051979e-06,  1.4359505e-06], dtype=float32)), (DeviceArray([[-0.6747592],
                 [-0.8550365]], dtype=float32), DeviceArray([-8.9325863e-07], dtype=float32)), ()] 
    
    print("-- Inputs --")
    print("x :", x, "\n")
    
    y = serial(x)
    print("-- Outputs --")
    print("y :", y)
    
    -- Inputs --
    x : [-2 -1  0  1  2] 
    
    -- Outputs --
    y : [0.]
    

JAX

Just remember to lookout for which numpy you are using, the regular numpy or Trax's JAX compatible numpy. Watch those import blocks. Numpy and fastmath.numpy have different data types.

Regular numpy.

x_numpy = numpy.array([1, 2, 3])
print("good old numpy : ", type(x_numpy), "\n")
good old numpy :  <class 'numpy.ndarray'> 

Fastmath and jax numpy.

x_jax = fastmath.numpy.array([1, 2, 3])
print("jax trax numpy : ", type(x_jax))
jax trax numpy :  <class 'jax.interpreters.xla._DeviceArray'>

End

  • Trax is a concise framework, built on TensorFlow, for end to end machine learning. The key building blocks are layers and combinators.
  • This was a lab that was part of coursera's Natural Language Processing with Sequence Models course put up by DeepLearning.AI.

Word Embeddings: Visualizing the Embeddings

Extracting and Visualizing the Embeddings

In the previous post we built a Continuous Bag of Words model to predict a word based on the fraction of words each word surrounding it made up within a window (e.g. the fraction of the four words surrounding the word that each word made up). Now we're going to use the weights of the model as word embeddings and see if we can visualize them.

Imports

# python
from argparse import Namespace
from functools import partial

# pypi
from sklearn.decomposition import PCA

import holoviews
import hvplot.pandas
import pandas

# this project
from neurotic.nlp.word_embeddings import (
    Batches,
    CBOW,
    DataCleaner,
    MetaData,
    TheTrainer,
    )
# my other stuff
from graeae import EmbedHoloviews, Timer

Set Up

cleaner = DataCleaner()
meta = MetaData(cleaner.processed)
TIMER = Timer(speak=False)
SLUG = "word-embeddings-visualizing-the-embeddings"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{SLUG}")
Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 250
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layer, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

trainer = TheTrainer(model, batches, emit_point=50, verbose=True)
with TIMER:
    trainer()
2020-12-16 16:32:17,189 graeae.timers.timer start: Started: 2020-12-16 16:32:17.189213
50: loss=9.88889093658385
new learning rate: 0.0198
100: loss=9.138356897918037
150: loss=9.149555378031549
new learning rate: 0.013068000000000001
200: loss=9.077599951734605
2020-12-16 16:32:37,403 graeae.timers.timer end: Ended: 2020-12-16 16:32:37.403860
2020-12-16 16:32:37,405 graeae.timers.timer end: Elapsed: 0:00:20.214647
250: loss=8.607763835003631
print(trainer.best_loss)
8.186490214727549

Middle

Set It Up

We're going to use the method of averaging the weights of the two layers to form the embeddings.

embeddings = (trainer.best_weights.input_weights.T
              + trainer.best_weights.hidden_weights)/2

And now our words.

words = ["king", "queen","lord","man", "woman","dog","wolf",
         "rich","happy","sad"]

Now we need to translate the words into their indices so we can grab the rows in the mebedding that match.

indices = [meta.word_to_index[word] for word in words]
X = embeddings[indices, :]
print(X.shape, indices) 
(10, 50) [2745, 3951, 2961, 3023, 5675, 1452, 5674, 4191, 2316, 4278]

There are 10 rows to match our ten words and 50 columns to match the number chosen for the hidden layer.

Visualizing

We're going to use sklearn's PCA for Principal Component Analysis. The n_components argument is the number of components it will keep - we'll keep 2.

pca = PCA(n_components=2)
reduced = pca.fit(X).transform(X)
pca_data = pandas.DataFrame(
    reduced,
    columns=["X", "Y"])

pca_data["Word"] = words
points = pca_data.hvplot.scatter(x="X",
                                 y="Y", color=Plot.red)
labels = pca_data.hvplot.labels(x="X", y="Y", text="Word", text_baseline="top")
plot = (points * labels).opts(
    title="PCA Embeddings",
    height=Plot.height,
    width=Plot.width,
    fontscale=Plot.fontscale,
)
outcome = Embed(plot=plot, file_name="embeddings_pca")()
print(outcome)

Figure Missing

Well, that's pretty horrible. Might need work.

End

This is the final post in the series looking at using a Continuous Bag of Words model to create word embeddings. Here are the other posts.

Word Embeddings: Training the Model

Building and Training the Model

In the previous post we did some preliminary set up and data pre-processing. Now we're going to build and train a Continuous Bag of Words (CBOW) model.

Imports

# python
from argparse import Namespace
from enum import Enum, unique
from functools import partial

import math
import random

# pypi
from expects import be_true, contain_exactly, equal, expect

import holoviews
import hvplot.pandas
import numpy
import pandas

# this project
from neurotic.nlp.word_embeddings import DataCleaner, MetaData

# my other stuff
from graeae import EmbedHoloviews, Timer

Set Up

Code from the previous post.

cleaner = DataCleaner()
data = cleaner.processed
meta = MetaData(data)
TIMER = Timer(speak=False)
Embed = partial(EmbedHoloviews, folder_path="files/posts/nlp/word-embeddings-training-the-model")
Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Something to help remember what the numpy axis argument is.

@unique
class Axis(Enum):
    ROWS = 0
    COLUMNS = 1

Middle

Initializing the model

You will now initialize two matrices and two vectors.

  • The first matrix (\(W_1\)) is of dimension \(N \times V\), where V is the number of words in your vocabulary and N is the dimension of your word vector.
  • The second matrix (\(W_2\)) is of dimension \(V \times N\).
  • Vector \(b_1\) has dimensions \(N\times 1\)
  • Vector \(b_2\) has dimensions \(V\times 1\).
  • \(b_1\) and \(b_2\) are the bias vectors of the linear layers from matrices \(W_1\) and \(W_2\).

At this stage we are just initializing the parameters.

Please use numpy.random.rand to generate matrices that are initialized with random values from a uniform distribution, ranging between 0 and 1.

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: initialize_model
def initialize_model(N: int,V: int, random_seed: int=1) -> tuple:
    """Initialize the matrices with random values

    Args: 
       N:  dimension of hidden vector 
       V:  dimension of vocabulary
       random_seed: random seed for consistent results in the unit tests
     Returns: 
       W1, W2, b1, b2: initialized weights and biases
    """

    numpy.random.seed(random_seed)

    ### START CODE HERE (Replace instances of 'None' with your code) ###
    # W1 has shape (N,V)
    W1 = numpy.random.rand(N, V)
    # W2 has shape (V,N)
    W2 = numpy.random.rand(V, N)
    # b1 has shape (N,1)
    b1 = numpy.random.rand(N, 1)
    # b2 has shape (V,1)
    b2 = numpy.random.rand(V, 1)
    ### END CODE HERE ###

    return W1, W2, b1, b2

Test your function example.

tmp_N = 4
tmp_V = 10
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
expect(tmp_W1.shape).to(equal((tmp_N,tmp_V)))
expect(tmp_W2.shape).to(equal((tmp_V,tmp_N)))
expect(tmp_b1.shape).to(equal((tmp_N, 1)))
expect(tmp_b2.shape).to(equal((tmp_V, 1)))
print(f"tmp_W1.shape: {tmp_W1.shape}")
print(f"tmp_W2.shape: {tmp_W2.shape}")
print(f"tmp_b1.shape: {tmp_b1.shape}")
print(f"tmp_b2.shape: {tmp_b2.shape}")
tmp_W1.shape: (4, 10)
tmp_W2.shape: (10, 4)
tmp_b1.shape: (4, 1)
tmp_b2.shape: (10, 1)

Softmax

Before we can start training the model, we need to implement the softmax function as defined in equation 5:

\[ \text{softmax}(z_i) = \frac{e^{z_i} }{\sum_{i=0}^{V-1} e^{z_i} } \tag{5} \]

  • Array indexing in code starts at 0.
  • V is the number of words in the vocabulary (which is also the number of rows of z).
  • i goes from 0 to |V| - 1.

The Implementation

  • Assume that the input z to softmax is a 2D array
  • Each training example is represented by a column of shape (V, 1) in this 2D array.
  • There may be more than one column, in the 2D array, because you can put in a batch of examples to increase efficiency. Let's call the batch size lowercase m, so the z array has shape (V, m)
  • When taking the sum from \(i=1 \cdots V-1\), take the sum for each column (each example) separately.

Please use

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: softmax
def softmax(z: numpy.ndarray) -> numpy.ndarray:
    """Calculate the softmax

    Args: 
       z: output scores from the hidden layer
    Returns: 
       yhat: prediction (estimate of y)
    """

    ### START CODE HERE (Replace instances of 'None' with your own code) ###

    # Calculate yhat (softmax)
    yhat = numpy.exp(z)/numpy.sum(numpy.exp(z), axis=Axis.ROWS.value)

    ### END CODE HERE ###

    return yhat
# Test the function
tmp = numpy.array([[1,2,3],
                   [1,1,1]
                   ])
tmp_sm = softmax(tmp)
print(tmp_sm)
expected =  numpy.array([[0.5, 0.73105858, 0.88079708],
                         [0.5, 0.26894142, 0.11920292]])


expect(numpy.allclose(tmp_sm, expected)).to(be_true)
[[0.5        0.73105858 0.88079708]
 [0.5        0.26894142 0.11920292]]

Forward propagation

We're going to implement the forward propagation z according to equations (1) to (3).

\begin{align} h &= W_1 \ X + b_1 \tag{1} \\ a &= ReLU(h) \tag{2} \\ z &= W_2 \ a + b_2 \tag{3} \\ \end{align}

For that, you will use as activation the Rectified Linear Unit (ReLU) given by:

\[ f(h)=\max (0,h) \tag{6} \]

Hints:

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: forward_prop
def forward_prop(x: numpy.ndarray,
                 W1: numpy.ndarray, W2: numpy.ndarray,
                 b1: numpy.ndarray, b2: numpy.ndarray) -> tuple:
    """Pass the data through the network

    Args: 
       x:  average one hot vector for the context 
       W1, W2, b1, b2:  matrices and biases to be learned
    Returns: 
       z:  output score vector
    """

    ### START CODE HERE (Replace instances of 'None' with your own code) ###

    # Calculate h
    h = numpy.dot(W1, x) + b1

    # Apply the relu on h (store result in h)
    h = numpy.maximum(h, 0)

    # Calculate z
    z = numpy.dot(W2, h) + b2

    ### END CODE HERE ###

    return z, h

Test the function

tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T

tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)

print(f"x has shape {tmp_x.shape}")
print(f"N is {tmp_N} and vocabulary size V is {tmp_V}")

tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)

print("call forward_prop")
print()

print(f"z has shape {tmp_z.shape}")
print("z has values:")
print(tmp_z)

print()

print(f"h has shape {tmp_h.shape}")
print("h has values:")
print(tmp_h)

expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expected = numpy.array(
    [[0.55379268],
     [1.58960774],
     [1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)
expect(tmp_h.shape).to(equal((2, 1)))
expected = numpy.array(
    [[0.92477674],
     [1.02487333]]
)

expect(numpy.allclose(tmp_h, expected)).to(be_true)
x has shape (3, 1)
N is 2 and vocabulary size V is 3
call forward_prop

z has shape (3, 1)
z has values:
[[0.55379268]
 [1.58960774]
 [1.50722933]]

h has shape (2, 1)
h has values:
[[0.92477674]
 [1.02487333]]

Pack Index with Frequency

def index_with_frequency(context_words: list,
                              word_to_index: dict) -> list:
    """combines indexes and frequency counts-dict

    Args:
     context_words: words to get the indices for
     word_to_index: mapping of word to index

    Returns:
     list of (word-index, word-count) tuples built from context_words
    """
    frequency_dict = Counter(context_words)
    indices = [word_to_index[word] for word in context_words]
    packed = []
    for index in range(len(indices)):
        word_index = indices[index]
        frequency = frequency_dict[context_words[index]]
        packed.append((word_index, frequency))
    return packed

Vector Generator

def vectors(data: numpy.ndarray, word_to_index: dict, half_window: int):
    """Generates vectors of fraction of context words each word represents

    Args:
     data: source of the vectors
     word_to_index: mapping of word to index in the vocabulary
     half_window: number of tokens on either side of the word to keep

    Yields:
     tuple of x, y 
    """
    location = half_window
    vocabulary_size = len(word_to_index)
    while True:
        y = numpy.zeros(vocabulary_size)
        x = numpy.zeros(vocabulary_size)
        center_word = data[location]
        y[word_to_index[center_word]] = 1
        context_words = (data[(location - half_window): location]
                         + data[(location + 1) : (location + half_window + 1)])

        for word_index, frequency in index_with_frequency(context_words, word_to_index):
            x[word_index] = frequency/len(context_words)
        yield x, y
        location += 1
        if location >= len(data):
            print("location in data is being set to 0")
            location = 0
    return

Batch Generator

This uses a not so common form of the while loop. Whenever you run a loop and it reaches the end (so you didn't break it) then it will run the else clause.

def batch_generator(data: numpy.ndarray, word_to_index: dict,
                    half_window: int, batch_size: int, original: bool=True):
    """Generate batches of vectors

    Args:
     data: the training data
     word_to_index: map of word to vocabulary index
     half_window: number of tokens to take from either side of word
     batch_size: Number of vectors to put in each training batch
     original: run the original buggy code

    Yields:
     tuple of X, Y batches
    """
    vocabulary_size = len(word_to_index)
    batch_x = []
    batch_y = []
    for x, y in vectors(data,
                        word_to_index,
                        half_window):
        if original:
            while len(batch_x) < batch_size:
                batch_x.append(x)
                batch_y.append(y)

            else:
                yield numpy.array(batch_x).T, numpy.array(batch_y).T
        else:
            if len(batch_x) < batch_size:
                batch_x.append(x)
                batch_y.append(y)

            else:
                yield numpy.array(batch_x).T, numpy.array(batch_y).T
                batch_x = []
                batch_y = []
    return

So every time batch_x reaches the batch_size it yields the tuple and then creates a new batch before continuing the outer for-loop.

Cost function

The cross-entropy loss function.

def compute_cost(y: numpy.ndarray, y_hat: numpy.ndarray,
                 batch_size: int) -> numpy.ndarray:
    """Calculates the cross-entropy loss

    Args:
     y: array with the actual words labeled
     y_hat: our model's guesses for the words
     batch_size: the number of examples per training run
    """
    log_probabilities = (numpy.multiply(numpy.log(y_hat), y)
                         + numpy.multiply(numpy.log(1 - y_hat), 1 - y))
    cost = -numpy.sum(log_probabilities)/batch_size
    cost = numpy.squeeze(cost)
    return cost

Test the function

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)

tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))

print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")

tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")

tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")

tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

tmp_cost = compute_cost(tmp_y, tmp_yhat, tmp_batch_size)
print("call compute_cost")
print(f"tmp_cost {tmp_cost:.4f}")

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)
tmp_x.shape (5778, 4)
tmp_y.shape (5778, 4)
tmp_W1.shape (50, 5778)
tmp_W2.shape (5778, 50)
tmp_b1.shape (50, 1)
tmp_b2.shape (5778, 1)
tmp_z.shape: (5778, 4)
tmp_h.shape: (50, 4)
tmp_yhat.shape: (5778, 4)
call compute_cost
tmp_cost 9.9560

Training the Model - Backpropagation

Now that you have understood how the CBOW model works, you will train it. You created a function for the forward propagation. Now you will implement a function that computes the gradients to backpropagate the errors.

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: back_prop
def back_prop(x: numpy.ndarray,
              yhat: numpy.ndarray,
              y: numpy.ndarray,
              h: numpy.ndarray,
              W1: numpy.ndarray,
              W2: numpy.ndarray,
              b1: numpy.ndarray,
              b2: numpy.ndarray,
              batch_size: int) -> tuple:
    """Calculates the gradients

    Args: 
       x:  average one hot vector for the context 
       yhat: prediction (estimate of y)
       y:  target vector
       h:  hidden vector (see eq. 1)
       W1, W2, b1, b2:  matrices and biases  
       batch_size: batch size 

     Returns: 
       grad_W1, grad_W2, grad_b1, grad_b2:  gradients of matrices and biases   
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Compute l1 as W2^T (Yhat - Y)
    # Re-use it whenever you see W2^T (Yhat - Y) used to compute a gradient
    l1 = numpy.dot(W2.T, yhat - y)
    # Apply relu to l1
    l1 = numpy.maximum(l1, 0)
    # Compute the gradient of W1
    grad_W1 = numpy.dot(l1, x.T)/batch_size
    # Compute the gradient of W2
    grad_W2 = numpy.dot(yhat - y, h.T)/batch_size
    # Compute the gradient of b1
    grad_b1 = numpy.sum(l1, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
    # Compute the gradient of b2
    grad_b2 = numpy.sum(yhat - y, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
    ### END CODE HERE ###

    return grad_W1, grad_W2, grad_b1, grad_b2

Test the function

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)

# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))

print("get a batch of data")
print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")

print()
print("Initialize weights and biases")
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")

print()
print("Forwad prop to get z and h")
tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")

print()
print("Get yhat by calling softmax")
tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

tmp_m = (2*tmp_C)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)

print()
print("call back_prop")
print(f"tmp_grad_W1.shape {tmp_grad_W1.shape}")
print(f"tmp_grad_W2.shape {tmp_grad_W2.shape}")
print(f"tmp_grad_b1.shape {tmp_grad_b1.shape}")
print(f"tmp_grad_b2.shape {tmp_grad_b2.shape}")


expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))
get a batch of data
tmp_x.shape (5778, 4)
tmp_y.shape (5778, 4)

Initialize weights and biases
tmp_W1.shape (50, 5778)
tmp_W2.shape (5778, 50)
tmp_b1.shape (50, 1)
tmp_b2.shape (5778, 1)

Forwad prop to get z and h
tmp_z.shape: (5778, 4)
tmp_h.shape: (50, 4)

Get yhat by calling softmax
tmp_yhat.shape: (5778, 4)

call back_prop
tmp_grad_W1.shape (50, 5778)
tmp_grad_W2.shape (5778, 50)
tmp_grad_b1.shape (50, 1)
tmp_grad_b2.shape (5778, 1)

Gradient Descent

Now that you have implemented a function to compute the gradients, you will implement batch gradient descent over your training set.

Hint: For that, you will use initialize_model and the back_prop functions which you just created (and the compute_cost function). You can also use the provided get_batches helper function:

Also: print the cost after each batch is processed (use batch size = 128).

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: gradient_descent
def gradient_descent(data: numpy.ndarray, word2Ind: dict, N: int, V: int ,
                     num_iters: int, alpha=0.03):    
    """
    This is the gradient_descent function

    Args: 
       data:      text
       word2Ind:  words to Indices
       N:         dimension of hidden vector  
       V:         dimension of vocabulary 
       num_iters: number of iterations  

    Returns: 
       W1, W2, b1, b2:  updated matrices and biases   
    """
    W1, W2, b1, b2 = initialize_model(N,V, random_seed=282)
    batch_size = 128
    iters = 0
    C = 2
    for x, y in batch_generator(data, word2Ind, C, batch_size):
        ### START CODE HERE (Replace instances of 'None' with your own code) ###
        # Get z and h
        z, h = forward_prop(x, W1, W2, b1, b2)
        # Get yhat
        yhat = softmax(z)
        # Get cost
        cost = compute_cost(y, yhat, batch_size)
        if ( (iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        # Get gradients
        grad_W1, grad_W2, grad_b1, grad_b2 = back_prop(x,
                                                       yhat,
                                                       y,
                                                       h,
                                                       W1,
                                                       W2,
                                                       b1,
                                                       b2,
                                                       batch_size)

        # Update weights and biases
        W1 = W1 - alpha * grad_W1
        W2 = W2 - alpha * grad_W2
        b1 = b1 - alpha * grad_b1
        b2 = b2 - alpha * grad_b2

        ### END CODE HERE ###

        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66

    return W1, W2, b1, b2

Test Your Function

C = 2
N = 50
V = len(meta.vocabulary)
num_iters = 150
print("Call gradient_descent")
W1, W2, b1, b2 = gradient_descent(data, meta.word_to_index, N, V, num_iters)
Call gradient_descent
iters: 10 cost: 0.789141
iters: 20 cost: 0.105543
iters: 30 cost: 0.056008
iters: 40 cost: 0.038101
iters: 50 cost: 0.028868
iters: 60 cost: 0.023237
iters: 70 cost: 0.019444
iters: 80 cost: 0.016716
iters: 90 cost: 0.014660
iters: 100 cost: 0.013054
iters: 110 cost: 0.012133
iters: 120 cost: 0.011370
iters: 130 cost: 0.010698
iters: 140 cost: 0.010100
iters: 150 cost: 0.009566

End

The next post is one on extracting and visualizing the embeddings using Principal Component Analysis.

Bundling It Up

Imports

# python
from collections import Counter, namedtuple
from enum import Enum, unique

# pypi
import attr
import numpy

Enum Setup

@unique
class Axis(Enum):
    ROWS = 0
    COLUMNS = 1

Named Tuples

Gradients = namedtuple("Gradients", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])

Weights = namedtuple("Weights", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])

The CBOW Model

@attr.s(auto_attribs=True)
class CBOW:
    """A continuous bag of words model builder

    Args:
     hidden: number of rows in the hidden layer
     vocabulary_size: number of tokens in the vocabulary
     learning_rate: learning rate for back-propagation updates
     random_seed: int
    """
    hidden: int
    vocabulary_size: int
    learning_rate: float=0.03
    random_seed: int=1    
    _random_generator: numpy.random.PCG64=None

    # layer one
    _input_weights: numpy.ndarray=None
    _input_bias: numpy.ndarray=None

    # hidden layer
    _hidden_weights: numpy.ndarray=None
    _hidden_bias: numpy.ndarray=None
  • The Random Generator
    @property
    def random_generator(self) -> numpy.random.PCG64:
        """The random number generator"""
        if self._random_generator is None:
            self._random_generator = numpy.random.default_rng(self.random_seed)
        return self._random_generator
    
  • First Layer Weights

    These are initialized using numpy's new generator. I originally using their standard-normal version by mistake and the model did horrible. Using the Generator.random gives you a uniform distribution which seems to be what you're supposed to use.

    @property
    def input_weights(self) -> numpy.ndarray:
        """Weights for the first layer"""
        if self._input_weights is None:
            self._input_weights = self.random_generator.random(
                (self.hidden, self.vocabulary_size))
        return self._input_weights
    
  • First Layer Bias
    @property
    def input_bias(self) -> numpy.ndarray:
        """Bias for the input layer"""
        if self._input_bias is None:
            self._input_bias = self.random_generator.random(
                (self.hidden, 1)
            )
        return self._input_bias
    
  • Hidden Layer Weights
    @property
    def hidden_weights(self) -> numpy.ndarray:
        """The weights for the hidden layer"""
        if self._hidden_weights is None:
            self._hidden_weights = self.random_generator.random(
                (self.vocabulary_size, self.hidden)
            )
        return self._hidden_weights
    
  • Hidden Layer Bias
    @property
    def hidden_bias(self) -> numpy.ndarray:
        """Bias for the hidden layer"""
        if self._hidden_bias is None:
            self._hidden_bias = self.random_generator.random(
                (self.vocabulary_size, 1)
            )
        return self._hidden_bias
    
  • Softmax
    def softmax(self, scores: numpy.ndarray) -> numpy.ndarray:
        """Calculate the softmax
    
        Args: 
           scores: output scores from the hidden layer
        Returns: 
           yhat: prediction (estimate of y)"""
        return numpy.exp(scores)/numpy.sum(numpy.exp(scores), axis=Axis.ROWS.value)
    
  • Forward Propagation
    def forward(self, data: numpy.ndarray) -> tuple:
        """makes a model prediction
    
        Args:
         data: x-values to train on
    
        Returns:
         output, first-layer output
        """
        first_layer_output = numpy.maximum(numpy.dot(self.input_weights, data)
                                      + self.input_bias, 0)
        second_layer_output = (numpy.dot(self.hidden_weights, first_layer_output)
                       + self.hidden_bias)
        return second_layer_output, first_layer_output
    
  • Gradients
    def gradients(self, data: numpy.ndarray,
                  predicted: numpy.ndarray,
                  actual: numpy.ndarray,
                  hidden_input: numpy.ndarray) -> Gradients:
        """does the gradient calculation for back-propagation
    
        This is broken out to be able to troubleshoot/compare it
    
       Args:
         data: the input x value
         predicted: what our model predicted the labels for the data should be
         actual: what the actual labels should have been
         hidden_input: the input to the hidden layer
        Returns:
         Gradients for input_weight, hidden_weight, input_bias, hidden_bias
        """
        difference = predicted - actual
        batch_size = difference.shape[1]
        l1 = numpy.maximum(numpy.dot(self.hidden_weights.T, difference), 0)
    
        input_weights_gradient = numpy.dot(l1, data.T)/batch_size
        hidden_weights_gradient = numpy.dot(difference, hidden_input.T)/batch_size
        input_bias_gradient = numpy.sum(l1,
                                        axis=Axis.COLUMNS.value,
                                        keepdims=True)/batch_size
        hidden_bias_gradient = numpy.sum(difference,
                                         axis=Axis.COLUMNS.value,
                                         keepdims=True)/batch_size
        return Gradients(input_weights=input_weights_gradient,
                         hidden_weights=hidden_weights_gradient,
                         input_bias=input_bias_gradient,
                         hidden_bias=hidden_bias_gradient)
    
  • Backward Propagation
    def backward(self, data: numpy.ndarray,
                 predicted: numpy.ndarray,
                 actual: numpy.ndarray,
                 hidden_input: numpy.ndarray) -> None:
        """Does back-propagation to update the weights
    
       Arg:s
         data: the input x value
         predicted: what our model predicted the labels for the data should be
         actual: what the actual labels should have been
         hidden_input: the input to the hidden layer
        """
        gradients = self.gradients(data=data,
                                   predicted=predicted,
                                   actual=actual,
                                   hidden_input=hidden_input)
        # I don't have setters for the properties so use the private variables
        self._input_weights -= self.learning_rate * gradients.input_weights
        self._hidden_weights -= self.learning_rate * gradients.hidden_weights
        self._input_bias -= self.learning_rate * gradients.input_bias
        self._hidden_bias -= self.learning_rate * gradients.hidden_bias
        return
    
  • Call
    def __call__(self, data: numpy.ndarray) -> numpy.ndarray:
        """makes a prediction on the data
    
        Args:
         data: input data for the prediction
    
        Returns:
         softmax of model output
        """
        output, _ = self.forward(data)
        return self.softmax(output)
    

Batch Generator

@attr.s(auto_attribs=True)
class Batches:
    """Generates batches of data

    Args:
     data: the source of the data to generate (training data)
     word_to_index: dict mapping the word to the vocabulary index
     half_window: number of tokens on either side of word to grab
     batch_size: the number of entries per batch
     batches: number of batches to generate before quitting
     verbose: whether to emit messages
    """
    data: numpy.ndarray
    word_to_index: dict
    half_window: int
    batch_size: int
    batches: int
    repetitions: int=0
    verbose: bool=False    
    _vocabulary_size: int=None
    _vectors: object=None
  • Vocabulary Size
    @property
    def vocabulary_size(self) -> int:
        """Number of tokens in the vocabulary"""
        if self._vocabulary_size is None:
            self._vocabulary_size = len(self.word_to_index)
        return self._vocabulary_size
    
  • Vectors
    @property
    def vectors(self):
        """our vector-generator started up"""
        if self._vectors is None:
            self._vectors = self.vector_generator()
        return self._vectors
    
  • Indices and Frequencies
    def indices_and_frequencies(self, context_words: list) -> list:
        """combines word-indexes and frequency counts-dict
    
        Args:
         context_words: words to get the indices for
    
        Returns:
         list of (word-index, word-count) tuples built from context_words
        """
        frequencies = Counter(context_words)
        indices = [self.word_to_index[word] for word in context_words]
        return [(indices[index], frequencies[context_words[index]])
                for index in range(len(indices))]
    
  • Vectors
    def vector_generator(self):
        """Generates vectors infinitely
    
        x: fraction of context words represented by word
        y: array with 1 where center word is in the vocabulary and 0 elsewhere
    
        Yields:
         tuple of x, y 
        """
        location = self.half_window
        while True:
            y = numpy.zeros(self.vocabulary_size)
            x = numpy.zeros(self.vocabulary_size)
            center_word = self.data[location]
            y[self.word_to_index[center_word]] = 1
            context_words = (
                self.data[(location - self.half_window): location]
                + self.data[(location + 1) : (location + self.half_window + 1)])
    
            for word_index, frequency in self.indices_and_frequencies(context_words):
                x[word_index] = frequency/len(context_words)
            yield x, y
            location += 1
            if location >= len(self.data):
                if self.verbose:
                    print("location in data is being set to 0")
                location = 0
        return
    
  • Iterator Method
    def __iter__(self):
        """makes this into an iterator"""
        return self
    
  • Next Method
    def __next__(self) -> tuple:
        """Creates the batches and returns them
    
        Returns:
         x, y batches
        """
        batch_x = []
        batch_y = []
    
        if self.repetitions == self.batches:
            raise StopIteration()
        self.repetitions += 1    
        for x, y in self.vectors:
            if len(batch_x) < self.batch_size:
                batch_x.append(x)
                batch_y.append(y)
            else:
                return numpy.array(batch_x).T, numpy.array(batch_y).T
        return
    

The Trainer

@attr.s(auto_attribs=True)
class TheTrainer:
    """Something to train the model

    Args:
     model: thing to train
     batches: batch generator
     learning_impairment: rate to slow the model's learning
     impairment_point: how frequently to impair the learner
     emit_point: how frequently to emit messages
     verbose: whether to emit messages
    """
    model: CBOW
    batches: Batches
    learning_impairment: float=0.66
    impairment_point: int=100
    emit_point: int=10
    verbose: bool=False
    _losses: list=None
  • Losses
    @property
    def losses(self) -> list:
        """Holder for the training losses"""
        if self._losses is None:
            self._losses = []
        return self._losses
    
  • Gradient Descent
    def __call__(self):    
        """Trains the model using gradient descent
        """
        self.best_loss = float("inf")
        for repetitions, x_y in enumerate(self.batches):
            x, y = x_y
            output, hidden_input = self.model.forward(x)
            predictions = self.model.softmax(output)
    
            loss = self.cross_entropy_loss(predicted=predictions, actual=y)
            if loss < self.best_loss:
                self.best_loss = loss
                self.best_weights = Weights(
                    self.model.input_weights.copy(),
                    self.model.hidden_weights.copy(),
                    self.model.input_bias.copy(),
                    self.model.hidden_bias.copy(),
                )
            self.losses.append(loss)
            self.model.backward(data=x, predicted=predictions, actual=y,
                                hidden_input=hidden_input)
            if ((repetitions + 1) % self.impairment_point) == 0:
                self.model.learning_rate *= self.learning_impairment
                if self.verbose:
                    print(f"new learning rate: {self.model.learning_rate}")
            if self.verbose and ((repetitions + 1) % self.emit_point == 0):
                print(f"{repetitions + 1}: loss={self.losses[repetitions]}")
        return 
    
  • Cross-Entropy-Loss
    def cross_entropy_loss(self, predicted: numpy.ndarray,
                           actual: numpy.ndarray) -> numpy.ndarray:
        """Calculates the cross-entropy loss
    
        Args:
         predicted: array with the model's guesses
         actual: array with the actual labels
    
        Returns:
         the cross-entropy loss
        """
        log_probabilities = (numpy.multiply(numpy.log(predicted), actual)
                             + numpy.multiply(numpy.log(1 - predicted), 1 - actual))
        cost = -numpy.sum(log_probabilities)/self.batches.batch_size
        return numpy.squeeze(cost)
    

Testing It

from neurotic.nlp.word_embeddings import Batches, CBOW, TheTrainer

N = 4
V = len(meta.vocabulary)
model = CBOW(hidden=N, vocabulary_size=V)


expect(model.vocabulary_size).to(equal(V))
expect(model.input_weights.shape).to(equal((N, V)))
expect(model.hidden_weights.shape).to(equal((V, N)))
expect(model.input_bias.shape).to(equal((N, 1)))
expect(model.hidden_bias.shape).to(equal((V, 1)))

tmp = numpy.array([[1,2,3],
                   [1,1,1]
                   ])
tmp_sm = model.softmax(tmp)
expected =  numpy.array([[0.5, 0.73105858, 0.88079708],
                         [0.5, 0.26894142, 0.11920292]])


expect(numpy.allclose(tmp_sm, expected)).to(be_true)

Forward Propagation

tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)

model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_z, tmp_h = model.forward(tmp_x)

expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expect(tmp_h.shape).to(equal((2, 1)))

expected = numpy.array(
    [[0.55379268],
     [1.58960774],
     [1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)

expected = numpy.array(
    [[0.92477674],
     [1.02487333]]
)

expect(numpy.allclose(tmp_h, expected)).to(be_true)

Cross Entropy Loss

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=15,
                  half_window=tmp_C, batch_size=tmp_batch_size)

tmp_V = len(meta.vocabulary)

tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_z, tmp_h = model.forward(tmp_x)

tmp_yhat = model.softmax(tmp_z)

train = TheTrainer(model=model, batches=batches, verbose=True)
tmp_cost = train.cross_entropy_loss(actual=tmp_y, predicted=tmp_yhat)

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)

Back Propagation

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2
tmp_z, tmp_h = model.forward(tmp_x)
tmp_yhat = model.softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

gradients = model.gradients(data=tmp_x, predicted=tmp_yhat, actual=tmp_y, hidden_input=tmp_h)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)

expect(numpy.allclose(gradients.input_weights, tmp_grad_W1)).to(be_true)
expect(numpy.allclose(gradients.hidden_weights, tmp_grad_W2)).to(be_true)
expect(numpy.allclose(gradients.input_bias, tmp_grad_b1)).to(be_true)
expect(numpy.allclose(gradients.hidden_bias, tmp_grad_b2)).to(be_true)

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))

Putting Some Stuff Together

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
hidden_layers = 50

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=15,
                  half_window=tmp_C, batch_size=tmp_batch_size)
tmp_x, tmp_y = next(batches)
model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
prediction = model(tmp_x)

train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))

# using their initial weights
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
expect(model.input_weights.shape).to(equal(tmp_W1.shape))
expect(model.hidden_weights.shape).to(equal(tmp_W2.shape))
expect(model.input_bias.shape).to(equal(tmp_b1.shape))
expect(model.hidden_bias.shape).to(equal(tmp_b2.shape))

model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
prediction = model(tmp_x)

train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))
11.871189103548419
11.871189103548419
9.956016099656951
9.956016099656951

I changed the weights to use the uniform distribution which seems to work better, but weirdly it still does a little worse initially. The random-seed seems to be different for the old numpy random and their new generator.

The Batches

The original batch-generator had a couple of bugs in it. To avoid them pass in original=True.

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=5,
                  half_window=tmp_C, batch_size=tmp_batch_size)


old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C,
                                tmp_batch_size, original=False)


old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
expect(numpy.allclose(tmp_x, old_x)).to(be_true)
expect(numpy.allclose(tmp_y, old_y)).to(be_true)


old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
#expect(numpy.allclose(tmp_x, old_x)).to(be_true)
#expect(numpy.allclose(tmp_y, old_y)).to(be_true)

old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)

Gradient Descent

hidden_layers = 50
half_window = 2
batch_size = 128
repetitions = 150

model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
train = TheTrainer(model=model, batches=batches, verbose=True)
train()
10: loss=12.949165499168524
20: loss=7.1739091478289225
30: loss=13.431976455238479
40: loss=4.0062314323745545
50: loss=11.595407087927406
60: loss=10.41983077447342
70: loss=7.843047289924249
80: loss=12.529314536141994
90: loss=14.122707806423126
new learning rate: 0.0198
100: loss=10.80530164111974
110: loss=4.624869443165228
120: loss=5.552813055551899
130: loss=8.483428176366933
140: loss=9.047299388851195
150: loss=4.841072955589429

Gradient Re-do

Something's wrong with the trainer's gradient descent so I'm going to try and update the original function to do it.

def grady_the_ent(model: CBOW, data: numpy.ndarray,
                     num_iters: int, batches: Batches, alpha=0.03):
    """This is the gradient_descent function

    Args: 
       data:      text
       word2Ind:  words to Indices
       N:         dimension of hidden vector  
       V:         dimension of vocabulary 
       num_iters: number of iterations  

    Returns: 
       W1, W2, b1, b2:  updated matrices and biases   
    """
    batch_size = 128
    iters = 0
    C = 2
    for x, y in batches:
        z, h = model.forward(x)
        # Get yhat
        yhat = model.softmax(z)
        # Get cost
        cost = compute_cost(y, yhat, batch_size)
        if ((iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        grad_W1, grad_W2, grad_b1, grad_b2 = model.gradients(x,
                                                             yhat,
                                                             y,
                                                             h)

        # Update weights and biases
        model._input_weights -= alpha * grad_W1
        model._hidden_weights -= alpha * grad_W2
        model._input_bias -=  alpha * grad_b1
        model._hidden_bias -=  alpha * grad_b2

        ### END CODE HERE ###

        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66

    return
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
# batch_generator(data, word2Ind, C, batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 12.949165
iters: 20 cost: 7.173909
iters: 30 cost: 13.431976
iters: 40 cost: 4.006231
iters: 50 cost: 11.595407
iters: 60 cost: 10.419831
iters: 70 cost: 7.843047
iters: 80 cost: 12.529315
iters: 90 cost: 14.122708
iters: 100 cost: 10.805302
iters: 110 cost: 4.624869
iters: 120 cost: 5.552813
iters: 130 cost: 8.483428
iters: 140 cost: 9.047299
iters: 150 cost: 4.841073

So, something's wrong with the gradient descent.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
#                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 0.407862
iters: 20 cost: 0.090807
iters: 30 cost: 0.050924
iters: 40 cost: 0.035379
iters: 50 cost: 0.027105
iters: 60 cost: 0.021969
iters: 70 cost: 0.018470
iters: 80 cost: 0.015932
iters: 90 cost: 0.014008
iters: 100 cost: 0.012499
iters: 110 cost: 0.011631
iters: 120 cost: 0.010911
iters: 130 cost: 0.010274
iters: 140 cost: 0.009708
iters: 150 cost: 0.009201

It looks like it's the batches.

Troubleshooting the Batches

half_window = 2
batch_size = 128
repetitions = 150

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

start = random.randint(0, 100)
context = cleaner.processed[start: start + half_window] + cleaner.processed[start + half_window + 1: start + half_window * 2]
packed_1 = index_with_frequency(context, meta.word_to_index)
packed_2 = batches.indices_and_frequencies(context)
expect(packed_1).to(contain_exactly(*packed_2))

So the indices and frequencies is okay.

half_window = 2

v = vectors(cleaner.processed, meta.word_to_index, half_window)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
repetition = 0
for old, new in zip(v, batches.vectors):
    expect((old[0] == new[0]).all()).to(equal(True))
    expect((old[1] == new[1]).all()).to(equal(True))
    repetition += 1
    if repetition == repetitions:
        break

And the vectors look okay.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)
repetitions = 150
repetition = 0
# batch = next(batches)
for old in old_generator:
    batch_x = []
    batch_y = []
    for x, y in batches.vectors:
        while len(batch_x) < batches.batch_size:
            batch_x.append(x)
            batch_y.append(y)
        else:
            newx, newy = numpy.array(batch_x).T, numpy.array(batch_y).T
            expect((old[0]==newx).all()).to(equal(True))
            repetition += 1
            if repetition == repetitions:
                break
    else:
        continue
    break

So, weirdly, rolling the __next__= by hand seems to work.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)

repetition, repetitions = 0, 150
for old, new in zip(old_generator, batches):
    try:
        expect((old[0] == new[0]).all()).to(equal(True))
        expect((old[1] == new[1]).all()).to(equal(True))
    except AssertionError:
        print(repetition)
        break
    repetition += 1
    if repetition == repetitions:
        break
1

But not the batches.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)

repetition, repetitions = 0, 150
for old in old_generator:
    new = next(batches)
    expect(old[0].shape).to(equal(new[0].shape))
    try:
        expect((old[0] == new[0]).all()).to(equal(True))
        expect((old[1] == new[1]).all()).to(equal(True))
    except AssertionError:
        print(repetition)
        break
    repetition += 1
    if repetition == repetitions:
        break

Actually, it looks like the old generator might be broken.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
#                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 12.949165
iters: 20 cost: 7.173909
iters: 30 cost: 13.431976
iters: 40 cost: 4.006231
iters: 50 cost: 11.595407
iters: 60 cost: 10.419831
iters: 70 cost: 7.843047
iters: 80 cost: 12.529315
iters: 90 cost: 14.122708
iters: 100 cost: 10.805302
iters: 110 cost: 4.624869
iters: 120 cost: 5.552813
iters: 130 cost: 8.483428
iters: 140 cost: 9.047299
iters: 150 cost: 4.841073

The old generator wasn't creating new lists every time so it was just fitting the same batch of data every time… in fact it had a while loop instead of a conditional so it was just creating one batch with the same x and y lists repeated over and over so it should really be the worse performance, not the really good performance the original generator gave. I didn't re-run the ones above but this next set is being run after fixing my implementation.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 300
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
trainer = TheTrainer(model, batches, emit_point=50)
with TIMER:
    trainer()
2020-12-16 14:15:54,530 graeae.timers.timer start: Started: 2020-12-16 14:15:54.530779
2020-12-16 14:16:18,600 graeae.timers.timer end: Ended: 2020-12-16 14:16:18.600880
2020-12-16 14:16:18,602 graeae.timers.timer end: Elapsed: 0:00:24.070101
print(trainer.losses[0], trainer.losses[-1])
11.99601105791401 8.827228045367379

Not a huge improvement, but it didn't run for a long time either.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 1000
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

trainer = TheTrainer(model, batches, emit_point=100, verbose=True)
with TIMER:
    trainer()
2020-12-16 14:40:13,275 graeae.timers.timer start: Started: 2020-12-16 14:40:13.275964
new learning rate: 0.0198
100: loss=9.138356897918037
new learning rate: 0.013068000000000001
200: loss=9.077599951734605
new learning rate: 0.008624880000000001
300: loss=8.827228045367379
new learning rate: 0.005692420800000001
400: loss=8.556788482755191
new learning rate: 0.003756997728000001
500: loss=8.92744766914796
new learning rate: 0.002479618500480001
600: loss=9.052677036205138
new learning rate: 0.0016365482103168007
700: loss=8.914532962726918
new learning rate: 0.0010801218188090885
800: loss=8.885698480310062
new learning rate: 0.0007128804004139984
900: loss=9.042620463323736
2020-12-16 14:41:33,457 graeae.timers.timer end: Ended: 2020-12-16 14:41:33.457065
2020-12-16 14:41:33,458 graeae.timers.timer end: Elapsed: 0:01:20.181101
new learning rate: 0.000470501064273239
1000: loss=9.239992952104755

Hmm… doesn't seem to be improving.

losses = pandas.Series(trainer.losses)
line = holoviews.VLine(losses.idxmin()).opts(color=Plot.blue)
time_series = losses.hvplot().opts(title="Loss per Repetition",
                                   width=Plot.width, height=Plot.height,
                                   color=Plot.tan)

plot = time_series * line
output = Embed(plot=plot, file_name="training_1000")()
print(output)

Figure Missing

Since the losses are in a Series we can use its idxmin method to see when the losses bottomed out.

print(losses.idxmin())
247
print(losses.loc[247], losses.iloc[-1])
8.186490214727549 9.239992952104755

So it did the best at 247 and then got a little worse as we went along.

print(len(meta.word_to_index)/batch_size)
45.140625

We exhausted our data after 45 batches so I guess it's overfitting after a while.

Word Embeddings: Shakespeare Data

Beginning

This is the first part of as series on building word embeddings using a Continuous Bag of Words. There's an overview post that has links to all the posts in the series.

Imports

# python
import os
import random
import re

# pypi
from expects import equal, expect

Middle

We're going to be using the same dataset that we used in building the autocorrect system.

A Little Cleaning

Imports

# python
from pathlib import Path

import os
import re

# pypi
from dotenv import load_dotenv

import attr
import nltk

The Cleaner

@attr.s(auto_attribs=True)
class DataCleaner:
    """A cleaner for the word-embeddings data

    Args:
     key: environment key with path to the data file
     env_path: path to the .env file
    """
    key: str="SHAKESPEARE"
    env_path: str="posts/nlp/.env"
    stop: str="."
    _data_path: str=None
    _data: str=None
    _unpunctuated: str=None
    _punctuation: re.Pattern=None
    _tokens: list=None
    _processed: list=None
  • The Path To the Data
    @property
    def data_path(self) -> Path:
        """The path to the data file"""
        if self._data_path is None:
            load_dotenv(self.env_path)
            self._data_path = Path(os.environ[self.key]).expanduser()
        return self._data_path
    
  • The Data
    @property
    def data(self) -> str:
        """The data-file read in as a string"""
        if self._data is None:
            with self.data_path.open() as reader:
                self._data = reader.read()
        return self._data
    
  • The Punctuation Expression
    @property
    def punctuation(self) -> re.Pattern:
        """The regular expression to find punctuation"""
        if self._punctuation is None:
            self._punctuation = re.compile("[,!?;-]")
        return self._punctuation
    
  • The Un-Punctuated
    @property
    def unpunctuated(self) -> str:
        """The data with punctuation replaced by stop"""
        if self._unpunctuated is None:
            self._unpunctuated = self.punctuation.sub(self.stop, self.data)
        return self._unpunctuated
    
  • The Tokens

    We're going to use NLTK's word_tokenize function to tokenize the string.

    @property
    def tokens(self) -> list:
        """The tokenized data"""
        if self._tokens is None:
            self._tokens = nltk.word_tokenize(self.unpunctuated)
        return self._tokens
    
  • The Processed Tokens

    The final processed data will be all lowercased words and periods only.

    @property
    def processed(self) -> list:
        """The final processed tokens"""
        if self._processed is None:
            self._processed = [token.lower() for token in self.tokens
                               if token.isalpha() or token==self.stop]
        return self._processed
    

The Counter

@attr.s(auto_attribs=True)
class MetaData:
    """Compile some basic data about the data

    Args:
     data: the cleaned and tokenized data
    """
    data: list
    _distribution: nltk.probability.FreqDist=None
    _vocabulary: tuple=None
    _word_to_index: dict=None
  • The Frequency Distribution

    According to the doc-string, the FreqDist is meant to hold outcomes from experiments. It looks like a Counter with extra methods added.

    @property
    def distribution(self) -> nltk.probability.FreqDist:
        """The Token Frequency Distribution"""
        if self._distribution is None:
            self._distribution = nltk.FreqDist(self.data)
        return self._distribution
    
  • The Vocabulary
    @property
    def vocabulary(self) -> tuple:
        """The sorted unique tokens in the data"""
        if self._vocabulary is None:
            self._vocabulary = tuple(sorted(set(self.data)))
        return self._vocabulary
    
  • The Word-To-Index Mapping
    @property
    def word_to_index(self) -> dict:
        """Maps words to their index in the vocabulary"""
        if self._word_to_index is None:
            self._word_to_index = {word: index
                                   for index, word in enumerate(self.vocabulary)}
        return self._word_to_index
    

The Cleaned

from neurotic.nlp.word_embeddings import DataCleaner
cleaner = DataCleaner()
print(cleaner.unpunctuated[:50])
print(cleaner.tokens[:10])
print(cleaner.processed[:10])
print(f"Tokens: {len(cleaner.processed):,}")
O for a Muse of fire. that would ascend
The bright
['O', 'for', 'a', 'Muse', 'of', 'fire', '.', 'that', 'would', 'ascend']
['o', 'for', 'a', 'muse', 'of', 'fire', '.', 'that', 'would', 'ascend']
Tokens: 60,996

The Data Data

from neurotic.nlp.word_embeddings import MetaData
counter = MetaData(cleaner.processed)

print(f"Size of vocabulary: {len(counter.distribution):,}")
for token in counter.distribution.most_common(20):
    print(f" - {token}")
words = len(counter.distribution)
expect(len(counter.vocabulary)).to(equal(words))
expect(len(counter.word_to_index)).to(equal(words))
print(f"Size of the Vocabulary: {len(counter.vocabulary):,}")

index = random.randrange(words)
word = counter.vocabulary[index]
expect(index).to(equal(counter.word_to_index[word]))
Size of vocabulary: 5,778
 - ('.', 9630)
 - ('the', 1521)
 - ('and', 1394)
 - ('i', 1257)
 - ('to', 1159)
 - ('of', 1093)
 - ('my', 857)
 - ('that', 781)
 - ('in', 770)
 - ('a', 752)
 - ('you', 748)
 - ('is', 630)
 - ('not', 559)
 - ('for', 467)
 - ('it', 460)
 - ('with', 441)
 - ('his', 434)
 - ('but', 417)
 - ('me', 417)
 - ('your', 397)
Size of the Vocabulary: 5,778

End

Now that we have the data setup its time to build and train the model.

Word Embeddings: Build a Model

Introduction

This is and introduction to a series of posts that look at how to create word embeddings using a Continuous Bag Of Words (CBOW) model.

The Continuous Bag Of Words Model (CBOW)

Let's take a look at the following sentence: 'I am happy because I am learning'.

  • In continuous bag of words (CBOW) modeling, we try to predict the center word given a few context words (the words around the center word).
  • For example, if you were to choose a context half-size of say C = 2, then you would try to predict the word happy given the context that includes 2 words before and 2 words after the center word:
    • C words before: [I, am]
    • C words after: [because, I]
  • In other words:
context = [I,am, because, I]
target = happy

The model will be a three-layer one. The input layer (\(\bar x\)) is the average of all the one hot vectors of the context words. There will be one hidden layer, and the output layer (\(hat y\)) will be the softmax layer.

The architecture you will be implementing is as follows:

\begin{align} h &= W_1 \ X + b_1 \tag{1} \\ a &= ReLU(h) \tag{2} \\ z &= W_2 \ a + b_2 \tag{3} \\ \hat y &= softmax(z) \tag{4} \\ \end{align}

The Parts

This is just and introductory post, the following are the posts in the series where things will actually be implemented.

Extracting Word Embeddings

Introduction and Preliminaries

In the previous post we trained the CBOW model, now in this post we'll look at how to extract word embedding vectors from a model.

Imports

# from pypi
from expects import be_true, expect
import numpy

Preliminary Setup

Before moving on, you will be provided with some variables needed for further procedures, which should be familiar by now. Also a trained CBOW model will be simulated, the corresponding weights and biases are provided:

Define the tokenized version of the corpus.

words = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning']

Define V. Remember this is the size of the vocabulary.

vocabulary =  sorted(set(words))
V = len(vocabulary)

Get the word_to_index and index_to_word dictionaries for the tokenized corpus.

word_to_index = {word: index for index, word in enumerate(vocabulary)}
index_to_word = dict(enumerate(vocabulary))

Define first matrix of weights

W1 = numpy.array([
    [ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
    [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
    [ 0.26637602, -0.23846886, -0.37770863, -0.11399446,  0.34008124]])

Define second matrix of weights.

W2 = numpy.array([[-0.22182064, -0.43008631,  0.13310965],
                  [ 0.08476603,  0.08123194,  0.1772054 ],
                  [ 0.1871551 , -0.06107263, -0.1790735 ],
                  [ 0.07055222, -0.02015138,  0.36107434],
                  [ 0.33480474, -0.39423389, -0.43959196]])

Define first vector of biases.

b1 = numpy.array([[ 0.09688219],
                  [ 0.29239497],
                  [-0.27364426]])

Define second vector of biases.

b2 = numpy.array([[ 0.0352008 ],
                  [-0.36393384],
                  [-0.12775555],
                  [-0.34802326],
                  [-0.07017815]])

Extracting word embedding vectors

Once you have finished training the neural network, you have three options to get word embedding vectors for the words of your vocabulary, based on the weight matrices \(\mathbf{W_1}\) and/or \(\mathbf{W_2}\).

Option 1: Extract embedding vectors from \(\mathbf{W_1}\)

The first option is to take the columns of \(\mathbf{W_1}\) as the embedding vectors of the words of the vocabulary, using the same order of the words as for the input and output vectors.

Note: in this practice notebooks the values of the word embedding vectors are meaningless since we only trained for a single iteration with just one training example, but here's how you would proceed after the training process is complete.

For example \(\mathbf{W_1}\) is this matrix:

print(W1)
[[ 0.41687358  0.08854191 -0.23495225  0.28320538  0.41800106]
 [ 0.32735501  0.22795148 -0.23951958  0.4117634  -0.23924344]
 [ 0.26637602 -0.23846886 -0.37770863 -0.11399446  0.34008124]]

The first column, which is a 3-element vector, is the embedding vector of the first word of your vocabulary. The second column is the word embedding vector for the second word, and so on.

These are the words corresponding to the columns.

for word in vocabulary:
    print(f" - {word}")
- am
- because
- happy
- i
- learning

And the word embedding vectors corresponding to each word are:

for word, index in word_to_index.items():
    word_embedding_vector = W1[:, index]
    print(f'{word}:    \t{word_embedding_vector}')
am:     [0.41687358 0.32735501 0.26637602]
because:        [ 0.08854191  0.22795148 -0.23846886]
happy:          [-0.23495225 -0.23951958 -0.37770863]
i:      [ 0.28320538  0.4117634  -0.11399446]
learning:       [ 0.41800106 -0.23924344  0.34008124]

Option 2: Extract embedding vectors from \(\mathbf{W_2}\)

The second option is to transpose \(\mathbf{W_2}\) and take the columns of this transposed matrix as the word embedding vectors just like you did for \(\mathbf{W_1}\).

print(W2.T)
[[-0.22182064  0.08476603  0.1871551   0.07055222  0.33480474]
 [-0.43008631  0.08123194 -0.06107263 -0.02015138 -0.39423389]
 [ 0.13310965  0.1772054  -0.1790735   0.36107434 -0.43959196]]
for word, index in word_to_index.items():
    word_embedding_vector = W2.T[:, index]
    print(f'{word}:    \t{word_embedding_vector}')
am:     [-0.22182064 -0.43008631  0.13310965]
because:        [0.08476603 0.08123194 0.1772054 ]
happy:          [ 0.1871551  -0.06107263 -0.1790735 ]
i:      [ 0.07055222 -0.02015138  0.36107434]
learning:       [ 0.33480474 -0.39423389 -0.43959196]

Option 3: extract embedding vectors from \(\mathbf{W_1}\) and \(\mathbf{W_2}\)

The third option, which is the one you will use in this week's assignment, uses the average of \(\mathbf{W_1}\) and \(\mathbf{W_2^\intercal}\).

Calculate the average of \(\mathbf{W_1}\) and \(\mathbf{W_2^\intercal}\), and store the result in W3.

W3 = (W1 + W2.T)/2
print(W3)

expected = numpy.array([
    [ 0.09752647,  0.08665397, -0.02389858,  0.1768788 ,  0.3764029 ],
    [-0.05136565,  0.15459171, -0.15029611,  0.19580601, -0.31673866],
    [ 0.19974284, -0.03063173, -0.27839106,  0.12353994, -0.04975536]])
expect(numpy.allclose(W3, expected)).to(be_true)
[[ 0.09752647  0.08665397 -0.02389858  0.1768788   0.3764029 ]
 [-0.05136565  0.15459171 -0.15029611  0.19580601 -0.31673866]
 [ 0.19974284 -0.03063173 -0.27839106  0.12353994 -0.04975536]]

Extracting the word embedding vectors works just like the two previous options, by taking the columns of the matrix you've just created.

for word, index in word_to_index.items():
    word_embedding_vector = W3[:, index]
    print(f'{word}:    \t{word_embedding_vector}')
am:     [ 0.09752647 -0.05136565  0.19974284]
because:        [ 0.08665397  0.15459171 -0.03063173]
happy:          [-0.02389858 -0.15029611 -0.27839106]
i:      [0.1768788  0.19580601 0.12353994]
learning:       [ 0.3764029  -0.31673866 -0.04975536]

Now you know 3 different options to get the word embedding vectors from a model.

End

Now we've gone through the process of training a CBOW model in order to create word embeddings. The steps were:

Training the CBOW Model

Beginning

Previously we looked at preparing the data and how to set up the CBOW Model, now we'll look at training the model.

Imports

# python
import math

# from pypi
from expects import (
    be_true,
    equal,
    expect,
)
import numpy

Functions from Previous Posts

Data Preparation Functions

These were previously defined in Word Embeddings: Data Preparation post.

def window_generator(words: list, half_window: int):
    """Generates windows of words

    Args:
     words: cleaned tokens
     half_window: number of words in the half-window

    Yields:
     the next window
    """
    for center_index in range(half_window, len(words) - half_window):
        center_word = words[center_index]
        context_words = (words[(center_index - half_window) : center_index]
                         + words[(center_index + 1):(center_index + half_window + 1)])
        yield context_words, center_word
    return
def index_word_maps(data: list) -> tuple:
    """Creates index to word mappings

    The index is based on sorted unique tokens in the data

    Args:
       data: the data you want to pull from

    Returns:
       word2Ind: returns dictionary mapping the word to its index
       Ind2Word: returns dictionary mapping the index to its word
    """
    words = sorted(list(set(data)))

    word_to_index = {word: index for index, word in enumerate(words)}
    index_to_word = {index: word for index, word in enumerate(words)}
    return word_to_index, index_to_word
def word_to_one_hot_vector(word: str, word_to_index: dict, vocabulary_size: int) -> numpy.ndarray:
    """Create a one-hot-encoded vector

    Args:
     word: the word from the corpus that we're encoding
     word_to_index: map of the word to the index
     vocabulary_size: the size of the vocabulary

    Returns:
     vector with all zeros except where the word is
    """
    one_hot_vector = numpy.zeros(vocabulary_size)
    one_hot_vector[word_to_index[word]] = 1
    return one_hot_vector
ROWS = 0
def context_words_to_vector(context_words: list,
                            word_to_index: dict) -> numpy.ndarray:
    """Create vector with the mean of the one-hot-vectors

    Args:
     context_words: words to covert to one-hot vectors
     word_to_index: dict mapping word to index
    """
    vocabulary_size = len(word_to_index)
    context_words_vectors = [
        word_to_one_hot_vector(word, word_to_index, vocabulary_size)
        for word in context_words]
    return numpy.mean(context_words_vectors, axis=ROWS)
def training_example_generator(words: list, half_window: int, word_to_index: dict):
    """generates training examples

    Args:
     words: source of words
     half_window: half the window size
     word_to_index: dict with word to index mapping
    """
    vocabulary_size = len(word_to_index)
    for context_words, center_word in window_generator(words, half_window):
        yield (context_words_to_vector(context_words, word_to_index),
               word_to_one_hot_vector(
                   center_word, word_to_index, vocabulary_size))
    return

Activation Functions

These functions were defined in the Introducing the CBOW Model post.

def relu(z: numpy.ndarray) -> numpy.ndarray:
    """Get the ReLU for the input array

    Args:
     z: an array of numbers

    Returns:
     ReLU of z
    """
    result = z.copy()
    result[result < 0] = 0
    return result
def softmax(z: numpy.ndarray) -> numpy.ndarray:
    """Calculate Softmax for the input

    Args:
     v: array of values

    Returns:
     array of probabilities
    """
    e_z = numpy.exp(z)
    sum_e_z = numpy.sum(e_z)
    return e_z / sum_e_z

Word Embeddings: Training the CBOW model

In previous lecture notebooks you saw how to prepare data before feeding it to a continuous bag-of-words model, the model itself, its architecture and activation functions. This notebook will walk you through:

  • Forward propagation.
  • Cross-entropy loss.
  • Backpropagation.
  • Gradient descent.

Which are concepts necessary to understand how the training of the model works.

Neural Network Initialization

Let's dive into the neural network itself, which is shown below with all the dimensions and formulas you'll need.

Set N equal to 3. Remember that N is a hyperparameter of the CBOW model that represents the size of the word embedding vectors, as well as the size of the hidden layer.

Also set V equal to 5, which is the size of the vocabulary we have used so far.

# Define the size of the word embedding vectors and save it in the variable 'N'
N = 3

# Define V. Remember this was the size of the vocabulary in the previous lecture notebooks
V = 5

Initialization of the weights and biases

Before you start training the neural network, you need to initialize the weight matrices and bias vectors with random values.

In the assignment you will implement a function to do this yourself using numpy.random.rand. In this notebook, we've pre-populated these matrices and vectors for you.

Define the first matrix of weights

W1 = numpy.array([
    [ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
    [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
    [ 0.26637602, -0.23846886, -0.37770863, -0.11399446,  0.34008124]])

Define the second matrix of weights

W2 = numpy.array([[-0.22182064, -0.43008631,  0.13310965],
                  [ 0.08476603,  0.08123194,  0.1772054 ],
                  [ 0.1871551 , -0.06107263, -0.1790735 ],
                  [ 0.07055222, -0.02015138,  0.36107434],
                  [ 0.33480474, -0.39423389, -0.43959196]])

Define the first vector of biases

b1 = numpy.array([[ 0.09688219],
                  [ 0.29239497],
                  [-0.27364426]])

Define the second vector of biases

b2 = numpy.array([[ 0.0352008 ],
                  [-0.36393384],
                  [-0.12775555],
                  [-0.34802326],
                  [-0.07017815]])

Check that the dimensions of these matrices are correct.

print(f'V (vocabulary size): {V}')
print(f'N (embedding size / size of the hidden layer): {N}')

print(f'size of W1: {W1.shape} (NxV)')
print(f'size of b1: {b1.shape} (Nx1)')
print(f'size of W2: {W2.shape} (VxN)')
print(f'size of b2: {b2.shape} (Vx1)')

expect(W1.shape).to(equal((N, V)))
expect(b1.shape).to(equal((N, 1)))
expect(W2.shape).to(equal((V, N)))
expect(b2.shape).to(equal((V, 1)))
V (vocabulary size): 5
N (embedding size / size of the hidden layer): 3
size of W1: (3, 5) (NxV)
size of b1: (3, 1) (Nx1)
size of W2: (5, 3) (VxN)
size of b2: (5, 1) (Vx1)

Before moving forward, you will need some functions and variables defined in previous notebooks. They can be found next. Be sure you understand everything that is going on in the next cell, if not consider doing a refresh of the first lecture notebook.

Define the tokenized version of the corpus

words = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning']

Get 'word_to_index' and 'Ind2word' dictionaries for the tokenized corpus

word_to_index, index_to_word = index_word_maps(words)

The First Training Example

Run the next cells to get the first training example, made of the vector representing the context words "i am because i", and the target which is the one-hot vector representing the center word "happy".

training_examples = training_example_generator(words, 2, word_to_index)
x_array, y_array = next(training_examples)

In this notebook next is used because you will only be performing one iteration of training. In this week's assignment with the full training over several iterations you'll use regular for loops with the iterator that supplies the training examples.

The vector representing the context words, which will be fed into the neural network, is:

print(x_array)
[0.25 0.25 0.   0.5  0.  ]

The one-hot vector representing the center word to be predicted is:

print(y_array)
[0. 0. 1. 0. 0.]

Now convert these vectors into matrices (or 2D arrays) to be able to perform matrix multiplication on the right types of objects, as explained in a previous notebook.

 # Copy vector
 x = x_array.copy()

 # Reshape it
 x.shape = (V, 1)

 # Print it
 print(f'x:\n{x}\n')

 # Copy vector
 y = y_array.copy()

 # Reshape it
 y.shape = (V, 1)

 # Print it
 print(f'y:\n{y}')
x:
[[0.25]
 [0.25]
 [0.  ]
 [0.5 ]
 [0.  ]]

y:
[[0.]
 [0.]
 [1.]
 [0.]
 [0.]]

Forward Propagation

The Hidden Layer

Now that you have initialized all the variables that you need for forward propagation, you can calculate the values of the hidden layer using the following formulas:

\begin{align} \mathbf{z_1} = \mathbf{W_1}\mathbf{x} + \mathbf{b_1} \tag{1} \\ \mathbf{h} = \mathrm{ReLU}(\mathbf{z_1}) \tag{2} \\ \end{align}

First, you can calculate the value of \(\mathbf{z_1}\).

Compute z1 (values of first hidden layer before applying the ReLU function)

z1 = numpy.dot(W1, x) + b1

As expected you get an \(N\) by 1 matrix, or column vector with N elements, where N is equal to the embedding size, which is 3 in this example.

print(z1)
[[ 0.36483875]
 [ 0.63710329]
 [-0.3236647 ]]

You can now take the ReLU of \(\mathbf{z_1}\) to get \(\mathbf{h}\), the vector with the values of the hidden layer.

Compute h (z1 after applying ReLU function)

h = relu(z1)
print(h)
[[0.36483875]
 [0.63710329]
 [0.        ]]

Applying ReLU means that the negative element of \(\mathbf{z_1}\) has been replaced with a zero.

The Output Layer

Here are the formulas you need to calculate the values of the output layer, represented by the vector \(\mathbf{\hat y}\):

\begin{align} \mathbf{z_2} &= \mathbf{W_2}\mathbf{h} + \mathbf{b_2} \tag{3} \\ \mathbf{\hat y} &= \mathrm{softmax}(\mathbf{z_2}) \tag{4} \\ \end{align}

First, calculate \(\mathbf{z_2}\).

Compute z2 (values of the output layer before applying the softmax function)

z2 = numpy.dot(W2, h) + b2
print(z2)
expected = numpy.array([
    [-0.31973737],
    [-0.28125477],
    [-0.09838369],
    [-0.33512159],
    [-0.19919612]])
expect(numpy.allclose(z2, expected)).to(be_true)
[[-0.31973737]
 [-0.28125477]
 [-0.09838369]
 [-0.33512159]
 [-0.19919612]]

This is a V by 1 matrix, where V is the size of the vocabulary, which is 5 in this example.

Now calculate the value of \(\mathbf{\hat y}\).

Compute y_hat (z2 after applying softmax function)

y_hat = softmax(z2)
print(y_hat)
expected = numpy.array([
    [0.18519074],
    [0.19245626],
    [0.23107446],
    [0.18236353],
    [0.20891502]])
expect(numpy.allclose(expected, y_hat)).to(be_true)
[[0.18519074]
 [0.19245626]
 [0.23107446]
 [0.18236353]
 [0.20891502]]

As you've performed the calculations with random matrices and vectors (apart from the input vector), the output of the neural network is essentially random at this point. The learning process will adjust the weights and biases to match the actual targets better.

That being said, what word did the neural network predict?

prediction = numpy.argmax(y_hat)
print(f"The predicted word at index {prediction} is '{index_to_word[prediction]}'.")
The predicted word at index 2 is 'happy'.

The neural network predicted the word "happy": the largest element of \(\mathbf{\hat y}\) is the third one, and the third word of the vocabulary is "happy".

Cross-Entropy Loss

Now that you have the network's prediction, you can calculate the cross-entropy loss to determine how accurate the prediction was compared to the actual target.

Remember that you are working on a single training example, not on a batch of examples, which is why you are using loss and not cost, which is the generalized form of loss.

First let's recall what the prediction was.

print(y_hat)
[[0.18519074]
 [0.19245626]
 [0.23107446]
 [0.18236353]
 [0.20891502]]

And the actual target value is:

print(y)
[[0.]
 [0.]
 [1.]
 [0.]
 [0.]]

The formula for cross-entropy loss is:

\[ J=-\sum\limits_{k=1}^{V}y_k\log{\hat{y}_k} \tag{6} \]

Try implementing the cross-entropy loss function so you get more familiar working with numpy.

def cross_entropy_loss(y_predicted: numpy.ndarray,
                       y_actual: numpy.ndarray) -> numpy.ndarray:
    """Calculate cross-entropy loss  for the prediction

    Args:
     y_predicted: what our model predicted
     y_actual: the known labels

    Returns:
     cross-entropy loss for y_predicted
    """
    loss = -numpy.sum(y_actual * numpy.log(y_predicted))
    return loss

Hint 1:

To multiply two numpy matrices (such as <code>y</code> and <code>y_hat</code>) element-wise, you can simply use the <code>*</code> operator.

Hint 2:

Once you have a vector equal to the element-wise multiplication of y and y_hat, you can use numpy.sum to calculate the sum of the elements of this vector.

Now use this function to calculate the loss with the actual values of \(\mathbf{y}\) and \(\mathbf{\hat y}\).

loss = cross_entropy_loss(y_hat, y)
print(f"{loss:0.3f}")
expected = 1.4650152923611106
expect(math.isclose(loss, expected)).to(be_true)
1.465

This value is neither good nor bad, which is expected as the neural network hasn't learned anything yet.

The actual learning will start during the next phase: backpropagation.

Backpropagation

The formulas that you will implement for backpropagation are the following.

\begin{align} \frac{\partial J}{\partial \mathbf{W_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top \tag{7}\\ \frac{\partial J}{\partial \mathbf{W_2}} &= (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} \tag{8}\\ \frac{\partial J}{\partial \mathbf{b_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right ) \tag{9}\\ \frac{\partial J}{\partial \mathbf{b_2}} &= \mathbf{\hat{y}} - \mathbf{y} \tag{10} \end{align}

*Note: these formulas are slightly simplified compared to the ones in the lecture as you're working on a single training example, whereas the lecture provided the formulas for a batch of examples. In the assignment you'll be implementing the latter.

Let's start with an easy one.

Calculate the partial derivative of the loss function with respect to \(\mathbf{b_2}\), and store the result in grad_b2.

\[ \frac{\partial J}{\partial \mathbf{b_2}} = \mathbf{\hat{y}} - \mathbf{y} \tag{10} \]

Compute vector with partial derivatives of loss function with respect to b2

grad_b2 = y_hat - y
print(grad_b2)
expected = numpy.array([
    [ 0.18519074],
    [ 0.19245626],
    [-0.76892554],
    [ 0.18236353],
    [ 0.20891502]])
expect(numpy.allclose(grad_b2, expected)).to(be_true)
[[ 0.18519074]
 [ 0.19245626]
 [-0.76892554]
 [ 0.18236353]
 [ 0.20891502]]

Next, calculate the partial derivative of the loss function with respect to \(\mathbf{W_2}\), and store the result in grad_W2.

\[ \frac{\partial J}{\partial \mathbf{W_2}} = (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} \tag{8} \]

Hint: use .T to get a transposed matrix, e.g. h.T returns \(\mathbf{h^\top}\).

Compute matrix with partial derivatives of loss function with respect to W2.

grad_W2 = numpy.dot(y_hat - y, h.T)
print(grad_W2)
expected = numpy.array([
    [0.06756476,  0.11798563,  0.        ],
    [ 0.0702155 ,  0.12261452,  0.        ],
    [-0.28053384, -0.48988499,  0.        ],
    [ 0.06653328,  0.1161844 ,  0.        ],
    [ 0.07622029,  0.13310045,  0.        ]])

expect(numpy.allclose(grad_W2, expected)).to(be_true)
[[ 0.06756476  0.11798563  0.        ]
 [ 0.0702155   0.12261452  0.        ]
 [-0.28053384 -0.48988499  0.        ]
 [ 0.06653328  0.1161844   0.        ]
 [ 0.07622029  0.13310045  0.        ]]

Now calculate the partial derivative with respect to \(\mathbf{b_1}\) and store the result in grad_b1.

\[ \frac{\partial J}{\partial \mathbf{b_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right ) \tag{9} \]

Compute vector with partial derivatives of loss function with respect to b1.

grad_b1 = relu(numpy.dot(W2.T, y_hat - y))
print(grad_b1)
expected = numpy.array([
    [0.        ],
    [0.        ],
    [0.17045858]])
expect(numpy.allclose(grad_b1, expected)).to(be_true)
[[0.        ]
 [0.        ]
 [0.17045858]]

Finally, calculate the partial derivative of the loss with respect to \(\mathbf{W_1}\), and store it in grad_W1.

\[ \frac{\partial J}{\partial \mathbf{W_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top \tag{7} \] Compute matrix with partial derivatives of loss function with respect to W1.

grad_W1 = numpy.dot(relu(numpy.dot(W2.T, y_hat - y)), x.T)
print(grad_W1)
expected = numpy.array([
    [0.        , 0.        , 0.        , 0.        , 0.        ],
    [0.        , 0.        , 0.        , 0.        , 0.        ],
    [0.04261464, 0.04261464, 0.        , 0.08522929, 0.        ]])

expect(numpy.allclose(grad_W1, expected)).to(be_true)
[[0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.04261464 0.04261464 0.         0.08522929 0.        ]]

Before moving on to gradient descent, double-check that all the matrices have the expected dimensions.

print(f'V (vocabulary size): {V}')
print(f'N (embedding size / size of the hidden layer): {N}')
print(f'size of grad_W1: {grad_W1.shape} (NxV)')
print(f'size of grad_b1: {grad_b1.shape} (Nx1)')
print(f'size of grad_W2: {grad_W2.shape} (VxN)')
print(f'size of grad_b2: {grad_b2.shape} (Vx1)')

expect(grad_W1.shape).to(equal((N, V)))
expect(grad_b1.shape).to(equal((N, 1)))
expect(grad_W2.shape).to(equal((V, N)))
expect(grad_b2.shape).to(equal((V, 1)))
V (vocabulary size): 5
N (embedding size / size of the hidden layer): 3
size of grad_W1: (3, 5) (NxV)
size of grad_b1: (3, 1) (Nx1)
size of grad_W2: (5, 3) (VxN)
size of grad_b2: (5, 1) (Vx1)

Gradient descent

During the gradient descent phase, you will update the weights and biases by subtracting \(\alpha\) times the gradient from the original matrices and vectors, using the following formulas.

\begin{align} \mathbf{W_1} &\gets \mathbf{W_1} - \alpha \frac{\partial J}{\partial \mathbf{W_1}} \tag{11}\\ \mathbf{W_2} &\gets \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \tag{12}\\ \mathbf{b_1} &\gets \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \tag{13}\\ \mathbf{b_2} &\gets \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \tag{14}\\ \end{align}

First, let set a value for \(\alpha\).

alpha = 0.03

The updated weight matrix \(\mathbf{W_1}\) will be:

W1_new = W1 - alpha * grad_W1

Let's compare the previous and new values of \(\mathbf{W_1}\):

print('old value of W1:')
print(W1)
print()
print('new value of W1:')
print(W1_new)
old value of W1:
[[ 0.41687358  0.08854191 -0.23495225  0.28320538  0.41800106]
 [ 0.32735501  0.22795148 -0.23951958  0.4117634  -0.23924344]
 [ 0.26637602 -0.23846886 -0.37770863 -0.11399446  0.34008124]]

new value of W1:
[[ 0.41687358  0.08854191 -0.23495225  0.28320538  0.41800106]
 [ 0.32735501  0.22795148 -0.23951958  0.4117634  -0.23924344]
 [ 0.26509758 -0.2397473  -0.37770863 -0.11655134  0.34008124]]

The difference is very subtle (hint: take a closer look at the last row), which is why it takes a fair amount of iterations to train the neural network until it reaches optimal weights and biases starting from random values.

Now calculate the new values of \(\mathbf{W_2}\) (to be stored in W2_new), \(\mathbf{b_1}\) (in b1_new), and \(\mathbf{b_2}\) (in b2_new).

\begin{align} \mathbf{W_2} &\gets \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \tag{12}\\ \mathbf{b_1} &\gets \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \tag{13}\\ \mathbf{b_2} &\gets \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \tag{14}\\ \end{align}

Compute updated W2.

W2_new = W2 - alpha * grad_W2

Compute updated b1.

b1_new = b1 - alpha * grad_b1

Compute updated b2.

b2_new = b2 - alpha * grad_b2
print('W2_new')
print(W2_new)
print()
print('b1_new')
print(b1_new)
print()
print('b2_new')
print(b2_new)

w2_expected = numpy.array(
   [[-0.22384758, -0.43362588,  0.13310965],
    [ 0.08265956,  0.0775535 ,  0.1772054 ],
    [ 0.19557112, -0.04637608, -0.1790735 ],
    [ 0.06855622, -0.02363691,  0.36107434],
    [ 0.33251813, -0.3982269 , -0.43959196]])

b1_expected = numpy.array(
   [[ 0.09688219],
    [ 0.29239497],
    [-0.27875802]])

b2_expected = numpy.array(
   [[ 0.02964508],
    [-0.36970753],
    [-0.10468778],
    [-0.35349417],
    [-0.0764456 ]]
)

for actual, expected in zip((W2_new, b1_new, b2_new), (w2_expected, b1_expected, b2_expected)):
    expect(numpy.allclose(actual, expected)).to(be_true)
W2_new
[[-0.22384758 -0.43362588  0.13310965]
 [ 0.08265956  0.0775535   0.1772054 ]
 [ 0.19557112 -0.04637608 -0.1790735 ]
 [ 0.06855622 -0.02363691  0.36107434]
 [ 0.33251813 -0.3982269  -0.43959196]]

b1_new
[[ 0.09688219]
 [ 0.29239497]
 [-0.27875802]]

b2_new
[[ 0.02964508]
 [-0.36970753]
 [-0.10468778]
 [-0.35349417]
 [-0.0764456 ]]

Congratulations, you have completed one iteration of training using one training example!

You'll need many more iterations to fully train the neural network, and you can optimize the learning process by training on batches of examples, as described in the lecture. You will get to do this during this week's assignment.

End

Now that we know how to train the CBOW Model, we'll move on to extracting word embeddings from the model. This is part of a series of posts looking at some preliminaries for creating word-embeddings. There is a table-of-contents post here.

Introducing the CBOW Model

The Continuous Bag-Of-Words (CBOW) Model

In the previous post we prepared our data, now we'll look at how the CBOW model is constructed.

Imports

# from pypi
from expects import (
    be_true,
    equal,
    expect,
)
import numpy

Activation Functions

Let's start by implementing the activation functions, ReLU and softmax.

ReLU

ReLU is used to calculate the values of the hidden layer, in the following formulas:

\begin{align} \mathbf{z_1} &= \mathbf{W_1}\mathbf{x} + \mathbf{b_1} \tag{1} \\ \mathbf{h} &= \mathrm{ReLU}(\mathbf{z_1}) \tag{2} \\ \end{align}

Let's fix a value for \(\mathbf{z_1}\) as a working example.

numpy.random.seed(10)

# Define a 5X1 column vector using numpy
z_1 = 10 * numpy.random.rand(5, 1) - 5

# Print the vector
print(z_1)
[[ 2.71320643]
 [-4.79248051]
 [ 1.33648235]
 [ 2.48803883]
 [-0.01492988]]

Notice that using numpy's random.rand function returns a numpy array filled with values taken from a uniform distribution over [0, 1). Numpy allows vectorization so each value is multiplied by 10 and then 5 is subtracted from them.

To get the ReLU of this vector, you want all the negative values to become zeros.

First create a copy of this vector.

h = z_1.copy()

Now determine which of its values are negative.

print(h < 0)
[[False]
 [ True]
 [False]
 [False]
 [ True]]

You can now simply set all of the values which are negative to 0.

h[h < 0] = 0

And that's it: you have the ReLU of \(\mathbf{z_1}\).

print(h)
[[2.71320643]
 [0.        ]
 [1.33648235]
 [2.48803883]
 [0.        ]]

Now implement ReLU as a function.

def relu(z: numpy.ndarray) -> numpy.ndarray:
    """Get the ReLU for the input array

    Args:
     z: an array of numbers

    Returns:
     ReLU of z
    """
    result = z.copy()
    result[result < 0] = 0
    return result

And check that it's working.

z = numpy.array([[-1.25459881],
              [ 4.50714306],
              [ 2.31993942],
              [ 0.98658484],
              [-3.4398136 ]])

# Apply ReLU to it
actual = relu(z)
expected = numpy.array([[0.        ],
                        [4.50714306],
                        [2.31993942],
                        [0.98658484],
                        [0.        ]])

print(actual)

expect(numpy.allclose(actual, expected)).to(be_true)
[[0.        ]
 [4.50714306]
 [2.31993942]
 [0.98658484]
 [0.        ]]

SoftMax

The second activation function that you need is softmax. This function is used to calculate the values of the output layer of the neural network, using the following formulas:

\begin{align} \mathbf{z_2} &= \mathbf{W_2}\mathbf{h} + \mathbf{b_2} \tag{3} \\ \mathbf{\hat y} &= \mathrm{softmax}(\mathbf{z_2}) \tag{4} \\ \end{align}

To calculate softmax of a vector \(\mathbf{z}\), the i-th component of the resulting vector is given by:

\[ \textrm{softmax}(\textbf{z})_i = \frac{e^{z_i} }{\sum\limits_{j=1}^{V} e^{z_j} } \tag{5} \]

Let's work through an example.

z = numpy.array([9, 8, 11, 10, 8.5])
print(z)
[ 9.   8.  11.  10.   8.5]

You'll need to calculate the exponentials of each element, both for the numerator and for the denominator.

e_z = numpy.exp(z)

print(e_z)
[ 8103.08392758  2980.95798704 59874.1417152  22026.46579481
  4914.7688403 ]

The denominator is equal to the sum of these exponentials.

sum_e_z = numpy.sum(e_z)
print(f"{sum_e_z:,.2f}")
97,899.42

And the value of the first element of \(\textrm{softmax}(\textbf{z})\) is given by:

print(f"{e_z[0]/sum_e_z:0.4f}")
0.0828

This is for one element. You can use numpy's vectorized operations to calculate the values of all the elements of the \(\textrm{softmax}(\textbf{z})\) vector in one go.

Implement the softmax function.

def softmax(z: numpy.ndarray) -> numpy.ndarray:
    """Calculate Softmax for the input

    Args:
     v: array of values

    Returns:
     array of probabilities
    """
    e_z = numpy.exp(z)
    sum_e_z = numpy.sum(e_z)
    return e_z / sum_e_z

Now check that it works.

actual = softmax([9, 8, 11, 10, 8.5])
print(actual)
expected = numpy.array([0.08276948,
                        0.03044919,
                        0.61158833,
                        0.22499077,
                        0.05020223])

expect(numpy.allclose(actual, expected)).to(be_true)
[0.08276948 0.03044919 0.61158833 0.22499077 0.05020223]

Notice that the sum of all these values is equal to 1.

expect(numpy.sum(softmax([9, 8, 11, 10, 8.5]))).to(equal(1))

Dimensions: 1-D arrays vs 2-D column vectors

Before moving on to implement forward propagation, backpropagation, and gradient descent in the next lecture notebook, let's have a look at the dimensions of the vectors you've been handling until now.

Create a vector of length V filled with zeros.

Define V. Remember this was the size of the vocabulary in the previous lecture notebook

V = 5

Define vector of length V filled with zeros

x_array = numpy.zeros(V)
print(x_array)
[0. 0. 0. 0. 0.]

This is a 1-dimensional array, as revealed by the .shape property of the array.

print(x_array.shape)
(5,)

To perform matrix multiplication in the next steps, you actually need your column vectors to be represented as a matrix with one column. In numpy, this matrix is represented as a 2-dimensional array.

The easiest way to convert a 1D vector to a 2D column matrix is to set its `.shape` property to the number of rows and one column, as shown in the next cell.

# Copy vector
x_column_vector = x_array.copy()

# Reshape copy of vector
x_column_vector.shape = (V, 1)  # alternatively ... = (x_array.shape[0], 1)

# Print vector
print(x_column_vector)
[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]

The shape of the resulting "vector" is:

print(x_column_vector.shape)
(5, 1)

End

Now that we have the basics of the model we can move on to training the model.

Word Embeddings: Data Preparation

Data Preparation

This is a look at the types of things we're going to need to do in order to convert our twitter data into a form that we can use for machine learning.

Python Imports

This is pretty standard stuff. Only the re module for regular expressions is, strictly speaking, needed here.

from functools import partial
from pprint import pprint
from typing import Dict, Generator, List, Tuple

import re

PyPi Imports

These are libraries installed via the Python Package Index (pypi).

  • emoji: Convert emoji to text aliases and vice-versa
  • nltk: The Natural Language Toolkit
  • numpy: We'll use it for their arrays.
import emoji
import nltk
import numpy

And Now the Processing

The basic problem we have is that our data is made up of tweets (strings) with emojis, punctuation, and so on but we need to convert it to something numeric. The first step in that process is to standardize the strings and break them up into tokens.

Our Corpus

Here's a fake tweet that we can use to work our way through the steps.

corpus = emoji.emojize((
    "My :ogre: :red_heart: :moyai:, but "
    "my :goblin: must :poop:! Ho, :zany_face:. Human-animals $??"),
                       use_aliases=True)
print(corpus)

My πŸ‘Ή ❀ πŸ—Ώ, but my πŸ‘Ί must πŸ’©! Ho, πŸ€ͺ. Human-animals $??

Cleaning and Tokenization

Punctuation

The first thing we're going to do is replace the punctuation with periods (.) using re.sub .

ONE_OR_MORE = "+"
PERIOD = "."
PUNCTUATION = ",!?;-"
EXPRESSION = f"[{PUNCTUATION}]" + ONE_OR_MORE
data = re.sub(EXPRESSION, PERIOD, corpus)

print(f"First cleaning:  '{data}'")

First cleaning: 'My πŸ‘Ή ❀ πŸ—Ώ. but my πŸ‘Ί must πŸ’©. Ho. πŸ€ͺ. Human.animals $.'

Tokenize

Next, use NLTK's punkt word_tokenize to break our corpus into tokens. punkt is German for "period" and is the name of a system created by Tibor Kiss and Jan Strunk. There's a link to the original paper ("Unsupervised Multilingual Sentence Boundary Detection") on this page.

print(f" - Before: {data}")
data = nltk.word_tokenize(data)
print(f" - After tokenization:  {data}")
  • Before: My πŸ‘Ή ❀ πŸ—Ώ. but my πŸ‘Ί must πŸ’©. Ho. πŸ€ͺ. Human.animals $.
  • After tokenization: ['My', 'πŸ‘Ή', '❀', 'πŸ—Ώ', '.', 'but', 'my', 'πŸ‘Ί', 'must', 'πŸ’©', '.', 'Ho', '.', 'πŸ€ͺ', '.', 'Human.animals', '$', '.']

Lower-Case And More Cleaning

Now we'll reduce the tokens a little more:

  • lower-case everything
  • filter out everything but letters, periods, and emoji

We're going to use get_emoji_regexp which returns a regular expression object that matches emojis. Since it returns a compiled python Regular Expression object we can call its search method to see if the token matches an emoji.

print(f" - Before:  {data}")

emoji_expression = emoji.get_emoji_regexp()
data = [ token.lower() for token in data
         if any((token.isalpha(),
                 token== '.',
                 emoji_expression.search(token)))
       ]

print(f" - After:  {data}")
  • Before: ['My', 'πŸ‘Ή', '❀', 'πŸ—Ώ', '.', 'but', 'my', 'πŸ‘Ί', 'must', 'πŸ’©', '.', 'Ho', '.', 'πŸ€ͺ', '.', 'Human.animals', '$', '.']
  • After: ['my', 'πŸ‘Ή', '❀', 'πŸ—Ώ', '.', 'but', 'my', 'πŸ‘Ί', 'must', 'πŸ’©', '.', 'ho', '.', 'πŸ€ͺ', '.', '.']

One thing to notice is that it got rid of Human.animals so because of the way we're doing it, any hyphenated words are going to be eliminated.

Wrap It Together

While the steps were useful for illustrating things, it'll be more convenient to put it into a function for later.

def tokenize(corpus: str) -> list:
    """clean and tokenize the corpus

    Args:
     corpus: original source text

    Returns:
     list of cleaned tokens from the corpus
    """
    ONE_OR_MORE = "+"
    PUNCTUATION = ",!?;-"
    EXPRESSION = f"[{PUNCTUATION}]" + ONE_OR_MORE
    PERIOD = "."

    data = re.sub(EXPRESSION, PERIOD, corpus)
    expression = emoji.get_emoji_regexp()
    data = nltk.word_tokenize(data)
    data = [ token.lower() for token in data
             if any((token.isalpha(),
                     token== '.',
                     expression.search(token)))
       ]    
    return data

Now we can test it out.

corpus = emoji.emojize(
    ("Able was :clown_face:; ere :clown_face: saw :frog_face:! "
    "Rejoice! :cigarette: for $9?"), use_aliases=True)

# Print new corpus
print(f" - Corpus:  {corpus}")

# Save tokenized version of corpus into 'words' variable
words = tokenize(corpus)

# Print the tokenized version of the corpus
print(f" - Words (tokens):  {words}")
  • Corpus: Able was 🀑; ere 🀑 saw :frog_face:! Rejoice! 🚬 for $9?
  • Words (tokens): ['able', 'was', '🀑', '.', 'ere', '🀑', 'saw', '.', 'rejoice', '.', '🚬', 'for', '.']

Check with an alternative sentence.

source = emoji.emojize(
    ("I'm tired of being a token! Where's all the other "
     ":cheese_wedge:-sniffing"
     " Gnomish at? I bet theres' at least 2 of us :gorilla: "
     "out there, or maybe more..."),
    use_aliases=True)
print(f" - Before: {source}")
print(f" - After: {tokenize(source)}")
  • Before: I'm tired of being a token! Where's all the other πŸ§€-sniffing Gnomish at? I bet theres' at least 2 of us 🦍 out there, or maybe more…
  • After: ['i', 'tired', 'of', 'being', 'a', 'token', '.', 'where', 'all', 'the', 'other', 'πŸ§€.sniffing', 'gnomish', 'at', '.', 'i', 'bet', 'theres', 'at', 'least', 'of', 'us', '🦍', 'out', 'there', '.', 'or', 'maybe', 'more']

Interestingly, it removes "am" (in the contraction "I'm") but not "a", I guess because it's language neutral it can't understand the contractions the way some english-specific tokenizers can.

Sliding Window of Words

The idea behind word-embeddings is that we assume that the words around a word (the context) are what give us the meaning of the word so we create vectors whose distance to other vectors with similar contexts is closer than to those with more different contexts. So our data is made up of list of words around a word. If for example our sentence is:

Fruit flies like a banana.

And we get the context for the word "like" with a half-window (number of tokens on either side of the word) of 2, then our window will be:

["fruit", "flies", "a", "banana"]
GetWindowYield = Tuple[List[str], str]

def get_windows(words: List[str],
                half_window: int) -> Generator[GetWindowYield, None, None]:
    """Generates windows of words

    Args:
     words: cleaned tokens
     half_window: number of words in the half-window

    Yields:
     the next window
    """
    for center_index in range(half_window, len(words) - half_window):
        center_word = words[center_index]
        context_words = (words[(center_index - half_window) : center_index]
                         + words[(center_index + 1):(center_index + half_window + 1)])
        yield context_words, center_word
    return

The first argument of this function, words, is a list of words (or tokens). The second argument, half_window, is the context half-size. As I mentioned, for a given center word, the context words are made of half_window words to the left and half_window words to the right of the center word.

Now let's try it on the words we defined earlier using a window of 2.

for context, word in get_windows(words, 2):
    print(f" - {context}\t{word}")
  • ['able', 'was', '.', 'ere'] 🀑
  • ['was', '🀑', 'ere', '🀑'] .
  • ['🀑', '.', '🀑', 'saw'] ere
  • ['.', 'ere', 'saw', '.'] 🀑
  • ['ere', '🀑', '.', 'rejoice'] saw
  • ['🀑', 'saw', 'rejoice', '.'] .
  • ['saw', '.', '.', '🚬'] rejoice
  • ['.', 'rejoice', '🚬', 'for'] .
  • ['rejoice', '.', 'for', '.'] 🚬

The first example is made up of:

  • the context words "able", "was", ".", "ere",
  • and the center word to be predicted is a clown-face.

Once more with feeling.

for context, word in get_windows(tokenize("My baloney has a first name, it's Gerald."), 2):
    print(f" - {context}\t{word}")
  • ['my', 'baloney', 'a', 'first'] has
  • ['baloney', 'has', 'first', 'name'] a
  • ['has', 'a', 'name', '.'] first
  • ['a', 'first', '.', 'it'] name
  • ['first', 'name', 'it', 'gerald'] .
  • ['name', '.', 'gerald', '.'] it

It's a little more obvious now that the way we wrote it the last two tokens (Gerald and .) don't get a context, so if we wanted to make sure that we did we'd probably have to pad the tokens or figure out some other scheme.

Words To Vectors

The next step is to convert the words to vectors using the contexts and words.

Mapping words to indices and indices to words

The center words will be represented as one-hot vectors (vectors of all zeros except in the cell representing the word), and the vectors that represent context words are also based on one-hot vectors.

To create one-hot word vectors, we can start by mapping each unique word to a unique integer (or index). We'll start with a function named get_dict, that creates a Python dictionary that maps words to integers and back.

WordToIndex = Dict[str, int]
IndexToWord = Dict[int, str]
GetDictOutput = Tuple[WordToIndex, IndexToWord]

def get_dict(data: List[str]) -> GetDictOutput:
    """Creates index to word mappings

    The index is based on the sorted unique tokens in the data

    Args:
     data: the data you want to pull from

    Returns:
     word_to_index: returns dictionary mapping the word to its index
     index_to_word: returns dictionary mapping the index to its word
    """
    words = sorted(list(set(data)))

    word_to_index = {word: index for index, word in enumerate(words)}
    index_to_word = {index: word for index, word in enumerate(words)}
    return word_to_index, index_to_word

So, let's try it out with the corpus.

word_to_index, index_to_word = get_dict(words)
print(f" - {word_to_index}")
print(f" - {index_to_word}")
- {'.': 0, 'able': 1, 'ere': 2, 'for': 3, 'rejoice': 4, 'saw': 5, 'was': 6, '🚬': 7, '🀑': 8}
- {0: '.', 1: 'able', 2: 'ere', 3: 'for', 4: 'rejoice', 5: 'saw', 6: 'was', 7: '🚬', 8: '🀑'}

If it isn't obvious, the purpose of the word_to_index dictionary is to convert a word to an integer.

token = "ere"
print(f"Index of the word '{token}':  {word_to_index[token]}")
Index of the word 'ere':  2

And now in the other direction.

print(f"Word which has index 2:  '{index_to_word[2]}'")
Word which has index 2:  'ere'

Finally, we need to know how many unique tokens are in our data set. The unique tokens make up our "vocabulary".

vocabulary_size = len(word_to_index)
print(f"Size of vocabulary: {vocabulary_size}")
Size of vocabulary: 9

One-Hot Word Vectors

Now let's look at creating one-hot vectors for the words. We'll start with one word - "rejoice".

word = "rejoice"
word_index = word_to_index[word]
print(f"Index for '{word}': {word_index}")
Index for 'rejoice': 4

Now we'll create a vector that has as many cells as there are tokens in the vocabulary and populate it with zeros (using numpy.zeros). This is why we needed the vocabulary size.

center_word_vector = numpy.zeros(vocabulary_size)

print(center_word_vector)
assert len(center_word_vector) == vocabulary_size
assert center_word_vector.sum() == 0.0
[0. 0. 0. 0. 0. 0. 0. 0. 0.]

Now, to make the vector represent out word, we need to set the cell that represents the word to 1.

center_word_vector[word_index] = 1

And now we have our one-hot word vector.

print(center_word_vector)

the_ones = numpy.where(center_word_vector==1)
for item in the_ones:
    print(f"{index_to_word[int(item)]}")
[0. 0. 0. 0. 1. 0. 0. 0. 0.]
rejoice

So, like before, let's put everything into a function.

def word_to_one_hot_vector(word: str,
                           word_to_index: WordToIndex=word_to_index,
                           vocabulary_size: int=vocabulary_size) -> numpy.ndarray:
    """Create a one-hot-vector with a 1 where the word is


    Args:
     word: known token to add to the vector
     word_to_index: dict mapping word: index
     vocabulary_size: how long to make the vector

    Returns:
     vector with zeros everywhere except where the word is
    """
    one_hot_vector = numpy.zeros(vocabulary_size)
    one_hot_vector[word_to_index[word]] = 1
    return one_hot_vector

Now we can check that it worked out.

actual = word_to_one_hot_vector(word)
print(actual)
assert all(actual == center_word_vector)
[0. 0. 0. 0. 1. 0. 0. 0. 0.]

Context Word Vectors

So, now we come to the context words. It may not be quite as obvious what this is, since we said we're going to use one-hot vectors, but each context is made up of multiple words. What we'll do is calculate the average of the one-hot vectors representing the individual words.

As an illustration let's start with one set of context words.

contexts = get_windows(words, 2)

context_words, word = next(contexts)
print(f" - Word: {word}")
print(f" - Context: {context_words}")
  • Word: 🀑
  • Context: ['able', 'was', '.', 'ere']

To create the one-hot vector we're going to create a list of all the vectors for the words in the context.

context_words_vectors = [word_to_one_hot_vector(word)
                         for word in context_words]
pprint(context_words_vectors)
[array([0., 1., 0., 0., 0., 0., 0., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 1., 0., 0.]),
 array([1., 0., 0., 0., 0., 0., 0., 0., 0.]),
 array([0., 0., 1., 0., 0., 0., 0., 0., 0.])]

And now we can get the average of these vectors using numpy's mean function, to get the vector representation of the context words.

ROWS, COLUMNS = 0, 1
first = numpy.mean(context_words_vectors, axis=ROWS)
print(first)
[0.25 0.25 0.25 0.   0.   0.   0.25 0.   0.  ]

Once again, let's wrap those separate code blocks back into a single function.

def context_words_to_vector(context_words: List[str],
                            word_to_index: WordToIndex=word_to_index) -> numpy.ndarray:
    """Create vector with the mean of the one-hot-vectors

    Args:
     context_words: words to covert to one-hot vectors
     word_to_index: dict mapping word to index

    Returns:
     array with the mean of the one-hot vectors for the context_words
    """
    vocabulary_size = len(word_to_index)
    context_words_vectors = [
        word_to_one_hot_vector(word, word_to_index, vocabulary_size)
        for word in context_words]
    return numpy.mean(context_words_vectors, axis=ROWS)
second = context_words_to_vector(context_words)
print(second)
assert all(first==second)
[0.25 0.25 0.25 0.   0.   0.   0.25 0.   0.  ]

So, there you go. It isn't really a one-hot vector but is just based on one.

Building the training set

Now we can put them all together and create a training data set for a Continuous Bag of Words model.

print(words)
['able', 'was', '🀑', '.', 'ere', '🀑', 'saw', '.', 'rejoice', '.', '🚬', 'for', '.']
for context_words, center_word in get_windows(words, half_window=2):
    print(f'Context words:  {context_words} -> {context_words_to_vector(context_words)}')
    print(f"Center word:  {center_word} -> "
          f"{word_to_one_hot_vector(center_word)}")
    print()
Context words:  ['able', 'was', '.', 'ere'] -> [0.25 0.25 0.25 0.   0.   0.   0.25 0.   0.  ]
Center word:  🀑 -> [0. 0. 0. 0. 0. 0. 0. 0. 1.]

Context words:  ['was', '🀑', 'ere', '🀑'] -> [0.   0.   0.25 0.   0.   0.   0.25 0.   0.5 ]
Center word:  . -> [1. 0. 0. 0. 0. 0. 0. 0. 0.]

Context words:  ['🀑', '.', '🀑', 'saw'] -> [0.25 0.   0.   0.   0.   0.25 0.   0.   0.5 ]
Center word:  ere -> [0. 0. 1. 0. 0. 0. 0. 0. 0.]

Context words:  ['.', 'ere', 'saw', '.'] -> [0.5  0.   0.25 0.   0.   0.25 0.   0.   0.  ]
Center word:  🀑 -> [0. 0. 0. 0. 0. 0. 0. 0. 1.]

Context words:  ['ere', '🀑', '.', 'rejoice'] -> [0.25 0.   0.25 0.   0.25 0.   0.   0.   0.25]
Center word:  saw -> [0. 0. 0. 0. 0. 1. 0. 0. 0.]

Context words:  ['🀑', 'saw', 'rejoice', '.'] -> [0.25 0.   0.   0.   0.25 0.25 0.   0.   0.25]
Center word:  . -> [1. 0. 0. 0. 0. 0. 0. 0. 0.]

Context words:  ['saw', '.', '.', '🚬'] -> [0.5  0.   0.   0.   0.   0.25 0.   0.25 0.  ]
Center word:  rejoice -> [0. 0. 0. 0. 1. 0. 0. 0. 0.]

Context words:  ['.', 'rejoice', '🚬', 'for'] -> [0.25 0.   0.   0.25 0.25 0.   0.   0.25 0.  ]
Center word:  . -> [1. 0. 0. 0. 0. 0. 0. 0. 0.]

Context words:  ['rejoice', '.', 'for', '.'] -> [0.5  0.   0.   0.25 0.25 0.   0.   0.   0.  ]
Center word:  🚬 -> [0. 0. 0. 0. 0. 0. 0. 1. 0.]

Next we'll create a generator that yields the context-vectors.

def get_training_example(
        words: List[str], half_window: int=2,
        word_to_index: WordToIndex=word_to_index) -> Generator[numpy.ndarray,
                                                               None, None]:
    """generates training examples

    Args:
     words: source of words
     half_window: half the window size
     word_to_index: dict with word to index mapping

    Yields:
     array with the mean of the one-hot vectors for the context words
    """
    vocabulary_size = len(word_to_index)
    for context_words, center_word in get_windows(words, half_window):
        yield context_words_to_vector(context_words), word_to_one_hot_vector(
            center_word, word_to_index,
            vocabulary_size)
    return

The output of this function can be iterated on to get successive context word vectors and center word vectors, as demonstrated in the next cell.

for context_words_vector, center_word_vector in get_training_example(words):
    print(f'Context words vector:  {context_words_vector}')
    print(f'Center word vector:  {center_word_vector}')
    print()
Context words vector:  [0.25 0.25 0.25 0.   0.   0.   0.25 0.   0.  ]
Center word vector:  [0. 0. 0. 0. 0. 0. 0. 0. 1.]

Context words vector:  [0.   0.   0.25 0.   0.   0.   0.25 0.   0.5 ]
Center word vector:  [1. 0. 0. 0. 0. 0. 0. 0. 0.]

Context words vector:  [0.25 0.   0.   0.   0.   0.25 0.   0.   0.5 ]
Center word vector:  [0. 0. 1. 0. 0. 0. 0. 0. 0.]

Context words vector:  [0.5  0.   0.25 0.   0.   0.25 0.   0.   0.  ]
Center word vector:  [0. 0. 0. 0. 0. 0. 0. 0. 1.]

Context words vector:  [0.25 0.   0.25 0.   0.25 0.   0.   0.   0.25]
Center word vector:  [0. 0. 0. 0. 0. 1. 0. 0. 0.]

Context words vector:  [0.25 0.   0.   0.   0.25 0.25 0.   0.   0.25]
Center word vector:  [1. 0. 0. 0. 0. 0. 0. 0. 0.]

Context words vector:  [0.5  0.   0.   0.   0.   0.25 0.   0.25 0.  ]
Center word vector:  [0. 0. 0. 0. 1. 0. 0. 0. 0.]

Context words vector:  [0.25 0.   0.   0.25 0.25 0.   0.   0.25 0.  ]
Center word vector:  [1. 0. 0. 0. 0. 0. 0. 0. 0.]

Context words vector:  [0.5  0.   0.   0.25 0.25 0.   0.   0.   0.  ]
Center word vector:  [0. 0. 0. 0. 0. 0. 0. 1. 0.]

End

Now that we know how to creat the training set, we can move on to the CBOW model itself which will be covered in the next post. This is part of a series of posts looking at some preliminaries for creating word-embeddings. There is a table-of-contents post here.