Trax GRU Model

Creating a GRU Model Using Trax


# from pypi
from trax import layers
import trax


Trax Review

Trax allows us to define neural network architectures by stacking layers (similarly to other libraries such as Keras). For this the Serial() is often used as it is a combinator that allows us to stack layers serially using function composition.

Next we'll look at a simple vanilla NN architecture containing 1 hidden(dense) layer with 128 cells and output (dense) layer with 10 cells on which we apply the final layer of LogSoftMax.

simple = layers.Serial(

Each of the layers within the Serial combinator layer is considered a sublayer. Notice that unlike similar libraries, in Trax the activation functions are considered layers. To know more about the Serial layer check out the documentation for it.

Here's the representation for it.


Printing the model gives you the exact same information as the model's definition itself.

By just looking at the definition you can clearly see what is going on inside the neural network. Trax is very straightforward in the way a network is defined.

The GRU Model

To create a GRU model you will need to be familiar with the following layers (Documentation link attached with each layer name):

  • ShiftRight: Shifts the tensor to the right by padding on axis 1. The mode should be specified and it refers to the context in which the model is being used. Possible values are: 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to "train".
  • Embedding Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding.
  • GRU The GRU layer. It leverages another Trax layer called GRUCell. The number of GRU units should be specified and should match the number of elements in the word embedding. If you want to stack two consecutive GRU layers, it can be done by using python's list comprehension.
  • Dense Vanilla Dense layer.
  • LogSoftMax Log Softmax function.

Putting everything together the GRU model looks like this.

mode = 'train'
vocab_size = 256
model_dimension = 512
n_layers = 2

GRU = layers.Serial(
      layers.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
      [layers.GRU(n_units=model_dimension) for _ in range(n_layers)],

Next is a helper function that prints information for every layer (sublayer within Serial).

Try changing the parameters defined before the GRU model and see how it changes.

def show_layers(model, layer_prefix="Serial.sublayers"):
    print(f"Total layers: {len(model.sublayers)}\n")
    for i in range(len(model.sublayers)):
        print(f'{layer_prefix}_{i}: {model.sublayers[i]}\n')
Total layers: 6

Serial.sublayers_0: Serial[

Serial.sublayers_1: Embedding_256_512

Serial.sublayers_2: GRU_512

Serial.sublayers_3: GRU_512

Serial.sublayers_4: Dense_256

Serial.sublayers_5: LogSoftmax

Interesting that it inserted a second Serial for the ShiftRight…

Vanilla RNNs and GRUs

Vanilla RNNs, GRUs and the scan function


# from python
from argparse import Namespace
from collections import namedtuple
from time import perf_counter

# from pypi
from expects import be_true, expect
from numpy import random

import numpy

Set Up

The Sigmoid Function

def sigmoid(x: numpy.ndarray) -> numpy.ndarray:
    """Calculates the sigmoid of x

     x: the array (or float) to get the sigmoid for

     the sigmoid of x
    return 1.0 / (1.0 + numpy.exp(-x))


These are going to hold the arrays that we are using for calculation.

Weights = namedtuple("Weights", "w1 w2 w3 b1 b2 b3".split())
Inputs = namedtuple("Inputs", "X hidden_state".split())


The Forward Method For Vanilla RNNs and GRUs

In this part of the notebook, we'll look at the implementation of the forward method for a vanilla RNN and implement that same method for a GRU. For this excercise we'll use a set of random weights and variables with the following dimensions:

  • Embedding size (emb) : 128
  • Hidden state size (h_dim) : (16,1)

The weights w_ and biases b_ are initialized with dimensions (h_dim, emb + h_dim) and (h_dim, 1). We expect the hidden state h_t to be a column vector with size (h_dim,1) and the initial hidden state h_0 is a vector of zeros.

Now we'll set up the variables for the dimensions.

Dimension = Namespace(

Now we'll initialize the various arrays.


weights = Weights(
    w1 = random.standard_normal(
         Dimension.embedding + Dimension.hidden_state)),
    w2 = random.standard_normal(
         Dimension.embedding + Dimension.hidden_state)),
    w3 = random.standard_normal(
         Dimension.embedding + Dimension.hidden_state)),
    b1 = random.standard_normal((Dimension.hidden_state, 1)),
    b2 = random.standard_normal((Dimension.hidden_state, 1)),
    b3 = random.standard_normal((Dimension.hidden_state, 1)),  

inputs = Inputs(
    hidden_state = numpy.zeros((Dimension.hidden_state, 1)),
    X = random.standard_normal((Dimension.hidden_variables, Dimension.embedding, 1))

The Forward Method For Vanilla RNNs

The vanilla RNN cell is quite straight forward.

The computations made in a vanilla RNN cell are equivalent to the following equations:

\begin{equation} h^{\langle t \rangle}=g(W_{h}[h^{\langle t-1 \rangle},x^{\langle t \rangle}] + b_h) \label{eq: htRNN} \end{equation} \begin{equation} \hat{y}^{\langle t \rangle}=g(W_{yh}h^{\langle t \rangle} + b_y) \label{eq: ytRNN} \end{equation}

Where \([h^{\langle t-1 \rangle},x^{\langle t \rangle}]\) means that \(h^{\langle t-1 \rangle}\) and \(x^{\langle t \rangle}\) are concatenated together.

Here's the implementation of the forward method for a vanilla RNN.

def forward_vanilla_RNN(inputs: tuple, weights: tuple) -> tuple:
    Forward propagation for a a single vanilla RNN cell

     inputs: collection of x and the hidden state
     weights: collections of weights and biases

     hidden state twice (so we don't have to implement y for the scan)
    x, hidden_state = inputs
    w1, _, _, b1, _, __ = weights
    h_t =,
                                       x])) + b1
    h_t = sigmoid(h_t)
    return h_t, h_t

As you can see, we omitted the computation of \(\hat{y}^{\langle t \rangle}\). This was done for the sake of simplicity, so you can focus on the way that hidden states are updated here and in the GRU cell.

The Forward Method For GRUs

A GRU cell has more computations than the ones that vanilla RNNs have.

GRUs have relevance \(\Gamma_r\) and update \(\Gamma_u\) gates that control how the hidden state \(h^{\langle t \rangle}\) is updated on every time step. With these gates, GRUs are capable of keeping relevant information in the hidden state even for long sequences. The equations needed for the forward method in GRUs are:

\begin{equation} \Gamma_r=\sigma{(W_r[h^{\langle t-1\rangle}, x^{\langle t\rangle}]+b_r)} \end{equation} \begin{equation} \Gamma_u=\sigma{(W_u[h^{\langle t-1\rangle}, x^{\langle t\rangle}]+b_u)} \end{equation} \begin{equation} c^{\langle t\rangle}=\tanh{(W_h[\Gamma_r*h^{\langle t-1\rangle},x^{\langle t\rangle}]+b_h)} \end{equation} \begin{equation} h^{\langle t\rangle}=\Gamma_u*c^{\langle t\rangle}+(1-\Gamma_u)*h^{\langle t-1\rangle} \end{equation}

In the next cell, we'll implement the forward method for a GRU cell by computing the update u and relevance r gates, and the candidate hidden state c.

def forward_GRU(inputs: tuple, weights: Namespace) -> tuple:
    Forward propagation for a single GRU cell

     inputs: collection of (x, h_t)
     weights: tuple of weights

     updated hidden weights twice
    x, h_t = inputs

    # weights.
    wu, wr, wc, bu, br, bc = weights

    # Update gate
    u =, numpy.concatenate([h_t, x])) + bu
    u = sigmoid(u)

    # Relevance gate
    r =, numpy.concatenate([h_t, x])) + br
    r = sigmoid(r)

    # Candidate hidden state 
    c =, numpy.concatenate([r * h_t, x])) + bc
    c = numpy.tanh(c)

    # New Hidden state h_t
    h_t = u * c + (1 - u) * h_t
    return h_t, h_t
  • A Check
    actual = forward_GRU([inputs.X[1], inputs.hidden_state], weights)[0]
    expected = numpy.array([[ 9.77779014e-01],
                            [ 2.10804828e-02],
                            [ 9.77365398e-05],
                            [ 9.99833090e-01],
                            [ 1.63200940e-08],
                            [ 8.51874303e-01],
                            [ 5.21399924e-02],
                            [ 2.15495959e-02],
                            [ 9.99878828e-01],
                            [ 9.77165472e-01]])
    expect(numpy.allclose(actual, expected)).to(be_true)
    [[ 9.77779014e-01]
     [ 2.10804828e-02]
     [ 9.77365398e-05]
     [ 9.99833090e-01]
     [ 1.63200940e-08]
     [ 8.51874303e-01]
     [ 5.21399924e-02]
     [ 2.15495959e-02]
     [ 9.99878828e-01]
     [ 9.77165472e-01]]

Part 2: Implementation of the scan function

The scan function is used for forward propagation in RNNs. It takes as inputs:

  • fn : the function to be called recurrently (i.e. forward_GRU)
  • elems : the list of inputs for each time step (X)
  • weights : the parameters needed to compute fn
  • h_0 : the initial hidden state

scan goes through all the elements x in elems, calls the function fn with arguments ([=x=, h_t=],=weights), stores the computed hidden state h_t and appends the result to a list ys. Complete the following cell by calling fn with arguments ([=x=, h_t=],=weights).

def scan(fn, elems, weights, h_0=None) -> tuple:
    Forward propagation for RNNs

     function: callable that updates the hidden state
      elems: input (x)
      weights: collection of weights
      h_0: the initial hidden weights
    h_t = h_0
    ys = []
    for x in elems:
        y, h_t = fn([x, h_t], weights)
    return ys, h_t

Comparing Vanilla RNNs and GRUs

You have already seen how forward propagation is computed for vanilla RNNs and GRUs. As a quick recap, you need to have a forward method for the recurrent cell and a function like scan to go through all the elements from a sequence using a forward method. You saw that GRUs performed more computations than vanilla RNNs, and you can check that they have 3 times more parameters. In the next two cells, we compute forward propagation for a sequence with 256 time steps (T) for an RNN and a GRU with the same hidden state h_t size (=h_dim==16).

Vanilla RNNs

We'll train the RNN and also time it.

tick = perf_counter()
ys, h_T = scan(forward_vanilla_RNN, inputs.X, weights, inputs.hidden_state)
tock = perf_counter()
RNN_time=(tock-tick) * 1000
print (f"It took {RNN_time:.2f}ms to run the forward method for the vanilla RNN.")
It took 2.03ms to run the forward method for the vanilla RNN.


tick = perf_counter()
ys, h_T = scan(forward_GRU, inputs.X, weights, inputs.hidden_state)
tock = perf_counter()
GRU_time=(tock - tick) * 1000
print (f"It took {GRU_time:.2f}ms to run the forward method for the GRU.")
It took 5.48ms to run the forward method for the GRU.

GRUs take more time to compute. This means that training and prediction would take more time for a GRU than for a vanilla RNN. However, GRUs allow you to propagate relevant information even for long sequences, so when selecting an architecture for NLP we should assess the tradeoff between computational time and performance.

Jax, Numpy, and Perplexity



Note to future self: The default jax installation from pip is CPU only, to get it to run on the GPU (which seems to be the main reason to use it) you need to specify it. Right now the command is:

pip install jaxlib==0.1.57+cuda111 -f

Where cuda111 refers to the fact that I have cuda 11.1 installed on the server, so I need that version. See the installation instructions for more information (and to see if anything changes).

# from python
from argparse import Namespace
from pathlib import Path

import os

# from pypi
from dotenv import load_dotenv
from trax import layers

import numpy
import trax
import trax.fastmath.numpy as trax_numpy

Set Up

The Data Paths

load_dotenv("posts/nlp/.env", override=True)
Paths = Namespace(

The Random Seed

SEED = 32

# trax no longer has a global seed setting - pass it to the training.Loop
# trax.supervised.trainer_lib.init_random_number_generators(SEED)


Numpy vs Trax

One important change to take into consideration is that the types of the resulting objects will be different depending on the version of numpy. With regular numpy you get numpy.ndarray but with Trax's numpy you will get jax.interpreters.xla.DeviceArray. These two types map to each other. So if you find some error logs mentioning DeviceArray type, don't worry about it, treat it like you would treat an ndarray and march ahead.

You can get a randomized numpy array by using the numpy.random.random() function.

This is one of the functionalities that Trax's numpy does not currently support in the same way as the regular numpy.

numpy_array = numpy.random.random((5,10))
print(f"The regular numpy array looks like this:\n\n {numpy_array}\n")
print(f"It is of type: {type(numpy_array)}")
The regular numpy array looks like this:

 [[0.85888927 0.37271115 0.55512878 0.95565655 0.7366696  0.81620514
  0.10108656 0.92848807 0.60910917 0.59655344]
 [0.09178413 0.34518624 0.66275252 0.44171349 0.55148779 0.70371249
  0.58940123 0.04993276 0.56179184 0.76635847]
 [0.91090833 0.09290995 0.90252139 0.46096041 0.45201847 0.99942549
  0.16242374 0.70937058 0.16062408 0.81077677]
 [0.03514717 0.53488673 0.16650012 0.30841038 0.04506241 0.23857613
  0.67483453 0.78238275 0.69520163 0.32895445]
 [0.49403187 0.52412136 0.29854125 0.46310814 0.98478429 0.50113492
  0.39807245 0.72790532 0.86333097 0.02616954]]

It is of type: <class 'numpy.ndarray'>

You can easily cast regular numpy arrays or lists into trax numpy arrays using the trax.fastmath.numpy.array() function:

trax_numpy_array = trax_numpy.array(numpy_array)
print(f"The trax numpy array looks like this:\n\n {trax_numpy_array}\n")
print(f"It is of type: {type(trax_numpy_array)}")
The trax numpy array looks like this:

 [[0.8588893  0.37271115 0.55512875 0.9556565  0.7366696  0.81620514
  0.10108656 0.9284881  0.60910916 0.59655344]
 [0.09178413 0.34518623 0.6627525  0.44171348 0.5514878  0.70371246
  0.58940125 0.04993276 0.56179184 0.7663585 ]
 [0.91090834 0.09290995 0.9025214  0.46096042 0.45201847 0.9994255
  0.16242374 0.7093706  0.16062407 0.81077677]
 [0.03514718 0.5348867  0.16650012 0.30841038 0.04506241 0.23857613
  0.67483455 0.7823827  0.69520164 0.32895446]
 [0.49403188 0.52412134 0.29854125 0.46310815 0.9847843  0.50113493
  0.39807245 0.72790533 0.86333096 0.02616954]]

It is of type: <class 'jax.interpreters.xla._DeviceArray'>

The previous section was a quick look at Trax's numpy. However this notebook also aims to teach you how you can calculate the perplexity of a trained model.

Calculating Perplexity

The perplexity is a metric that measures how well a probability model predicts a sample and it is commonly used to evaluate language models. It is defined as:

\[ P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}} \]

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our RNN, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure artificially good). The algebra behind this process is explained next:

\begin{align} log P(W) &= {log\left(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\right)} \\ &= {log\left({\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\right)^{\frac{1}{N}}} \\ &= {log\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\right)^{-\frac{1}{N}}} \\ &= -\frac{1}{N}{log\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\right)} \\ &= -\frac{1}{N}{\left({\sum_{i=1}^{N}{logP(w_i| w_1,...,w_{n-1})}}\right)} \end{align}

We're going to use some pre-made arrays.

predictions = numpy.load(Paths.predictions)
targets = numpy.load(Paths.targets)

Now we'll cast the numpy arrays to jax.interpreters.xla.DeviceArrays.

predictions = trax_numpy.array(predictions)
targets = trax_numpy.array(targets)
print(f'predictions has shape: {predictions.shape}')
print(f'targets has shape: {targets.shape}')
predictions has shape: (32, 64, 256)
targets has shape: (32, 64)

Notice that the predictions have an extra dimension - this is the same length as the size of the vocabulary used. Because of this you will need a way of reshaping targets to match this shape. For this we will use trax.layers.one_hot.

Also note that we can get the size of the last dimension using predictions.shape[-1].

reshaped_targets = layers.one_hot(x=targets, n_categories=predictions.shape[-1])
print(f'reshaped_targets has shape: {reshaped_targets.shape}')
reshaped_targets has shape: (32, 64, 256)

By calculating the product of the predictions and the reshaped targets and summing across the last dimension, we can compute the total log perplexity.

total_log_perplexity = trax_numpy.sum(predictions * reshaped_targets, axis= -1)

Now you will need to account for the padding so this metric is not artificially deflated (since a lower perplexity means a better model). To identify which elements are padding and which are not, you can use np.equal() and get a tensor with True in the positions of actual values and False where there are paddings.

equals_zero = trax_numpy.equal(targets, 0)
[[False False False ...  True  True  True]
 [False False False ...  True  True  True]
 [False False False ...  True  True  True]
 [False False False ...  True  True  True]
 [False False False ...  True  True  True]
 [False False False ...  True  True  True]]

equals_zero is a boolean array that has True wherever the cell had a 0 and False everywhere else. To make it numeric we can subtract the boolean array from 1 (generally in python True is treated as 1 and False as 0).

non_pad = 1.0 - equals_zero
print(f'non_pad has shape: {non_pad.shape}\n')
print(f'non_pad looks like this: \n\n {non_pad}')
non_pad has shape: (32, 64)

non_pad looks like this: 

 [[1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]]

Now if we multiply total_log_perplexity by the non_pad we'll zero-out all the entries in total_log_perplexity where non_pad has zero.

real_log_perplexity = total_log_perplexity * non_pad
print(f'real perplexity still has shape: {real_log_perplexity.shape}')
real perplexity still has shape: (32, 64)

We can check the effect of filtering out the padding by looking at the two log perplexity tensors.

print(f'log perplexity tensor before filtering padding: \n\n {total_log_perplexity}\n')
print(f'log perplexity tensor after filtering padding: \n\n {real_log_perplexity}')
log perplexity tensor before filtering padding: 

 [[ -5.396545    -1.0311184   -0.66916656 ... -22.37673    -23.18771
  -21.843483  ]
 [ -4.5857706   -1.1341286   -8.538033   ... -20.15686    -26.837097
  -23.57502   ]
 [ -5.2223887   -1.2824144   -0.17312431 ... -21.328228   -19.854412
  -33.88444   ]
 [ -5.396545   -17.291681    -4.360766   ... -20.825802   -21.065838
  -22.443115  ]
 [ -5.9313164  -14.247417    -0.2637329  ... -26.743248   -18.38433
  -22.355278  ]
 [ -5.670536    -0.10595131   0.         ... -23.332523   -28.087376
  -23.878807  ]]

log perplexity tensor after filtering padding: 

 [[ -5.396545    -1.0311184   -0.66916656 ...  -0.          -0.
   -0.        ]
 [ -4.5857706   -1.1341286   -8.538033   ...  -0.          -0.
   -0.        ]
 [ -5.2223887   -1.2824144   -0.17312431 ...  -0.          -0.
   -0.        ]
 [ -5.396545   -17.291681    -4.360766   ...  -0.          -0.
   -0.        ]
 [ -5.9313164  -14.247417    -0.2637329  ...  -0.          -0.
   -0.        ]
 [ -5.670536    -0.10595131   0.         ...  -0.          -0.
   -0.        ]]

To get a single average log perplexity across all the elements in the batch you can sum across both dimensions and divide by the number of elements. Note that the result will be the negative of the real log perplexity of the model.

log_perplexity = -trax_numpy.sum(real_log_perplexity) / trax_numpy.sum(non_pad)
print(f"log perplexity: {log_perplexity:0.4f}, "
      f"perplexity: {trax_numpy.exp(log_perplexity):0.4f}")
log perplexity: 2.3281, perplexity: 10.2586

Hidden State Activation

Hidden State Activation

This is the hidden state activation function for a vanilla RNN.

\[ h^{\langle t\rangle}=g(W_{h}[h^{\langle t-1\rangle},x^{\langle t\rangle}] + b_h) \]

Which is another way of writing this:

\[ h^{\langle t\rangle}=g(W_{hh}h^{\langle t-1\rangle} \oplus W_{hx}x^{\langle t\rangle} + b_h) \]


  • \(W_{h}\) in the first formula is denotes the horizontal concatenation of \(W_{hh}\) and \(W_{hx}\) from the second formula.
  • \(W_{h}\) in the first formula is then multiplied by \([h^{\langle t-1\rangle},x^{\langle t\rangle}]\), another concatenation of parameters from the second formula but this time in a different direction, i.e vertical.

Let us see what this means computationally.


# from pypi
import numpy



Weights: Horizontal Concatenation

A join along the vertical boundary is called a horizontal concatenation or horizontal stack.

Visually, it looks like this:- \(W_h = \left [ W_{hh} \ | \ W_{hx} \right ]\).

We'll look at two different ways to achieve this using numpy.

Note: The values used to populate the arrays, below, have been chosen to aid in visual illustration only. They are NOT what you'd expect to use building a model, which would typically be random variables instead.

First create some dummy data. The numpy.full function creates an array of a given shape that all has the same values. Our first array is almost like numpy.ones except it uses the dtype of the number you pass in so it will be integers, not floats.

w_hh = numpy.full((3, 2), 1)
w_hx = numpy.full((3, 3), 9)

We could use some random initializations, but it would make it harder to see the joins.

print("-- Data --\n")
print("w_hh :")
print("w_hh shape :", w_hh.shape, "\n")
print("w_hx :")
print("w_hx shape :", w_hx.shape, "\n")
-- Data --

w_hh :
[[1 1]
 [1 1]
 [1 1]]
w_hh shape : (3, 2) 

w_hx :
[[9 9 9]
 [9 9 9]
 [9 9 9]]
w_hx shape : (3, 3) 
  • Option 1: concatenate - horizontal

    First we'll use numpy.concatenate.

    ROWS, COLUMNS = 0, 1
    w_h1 = numpy.concatenate((w_hh, w_hx), axis=COLUMNS)
    print("option 1 : concatenate\n")
    print("w_h :")
    print("w_h shape :", w_h1.shape, "\n")
    option 1 : concatenate
    w_h :
    [[1 1 9 9 9]
     [1 1 9 9 9]
     [1 1 9 9 9]]
    w_h shape : (3, 5) 
  • Option 2: hstack

    Now we'll try numpy.hstack.

    w_h2 = numpy.hstack((w_hh, w_hx))
    print("option 2 : hstack\n")
    print("w_h :")
    print("w_h shape :", w_h2.shape)
    option 2 : hstack
    w_h :
    [[1 1 9 9 9]
     [1 1 9 9 9]
     [1 1 9 9 9]]
    w_h shape : (3, 5)

    As you can see, hstack gives you the same thing as concatenate along columns, concatenate also allows you to concatenate along rows and is more general than hstack. Although hstack might be more intuitive.

Hidden State & Inputs: Vertical Concatenation

Joining along a horizontal boundary is called a vertical concatenation or vertical stack. Visually it looks like this:

\[ [h^{\langle t-1\rangle},x^{\langle t\rangle}] = \left[ \frac{h^{\langle t-1\rangle}}{x^{\langle t\rangle}} \right] \]

We'll look at two different ways to achieve this using numpy.

First create some more dummy data.

h_t_prev = numpy.full((2, 1), 1)
x_t = numpy.full((3, 1), 9)
print("-- Data --\n")
print("h_t_prev :")
print("h_t_prev shape :", h_t_prev.shape, "\n")
print("x_t :")
print("x_t shape :", x_t.shape, "\n")
-- Data --

h_t_prev :
h_t_prev shape : (2, 1) 

x_t :
x_t shape : (3, 1) 

Option 1: concatenate - Rows

ax_1 = numpy.concatenate(
    (h_t_prev, x_t), axis=ROWS
print("option 1 : concatenate\n")
print("ax_1 :")
print("ax_1 shape :", ax_1.shape, "\n")
option 1 : concatenate

ax_1 :
ax_1 shape : (5, 1) 

Option 2: vstack

vstack is much like hstack except instead of inserting columns it appends rows, more of what the word stack would seem to suggest.

ax_2 = numpy.vstack((h_t_prev, x_t))
print("option 2 : vstack\n")
print("ax_2 :")
print("ax_2 shape :", ax_2.shape)
option 2 : vstack

ax_2 :
ax_2 shape : (5, 1)

Verify Formulas

Now that we know how to do the concatenations, horizontal and vertical, let's verify that the two formulas produce the same result.

  • Formula 1: \(h^{\langle t\rangle}=g(W_{h}[h^{\langle t-1\rangle},x^{\langle t\rangle}] + b_h)\)
  • Formula 2: \(h^{\langle t\rangle}=g(W_{hh}h^{\langle t-1\rangle} \oplus W_{hx}x^{\langle t\rangle} + b_h)\)

We want to assure ourselves that Formula 1 \(\Leftrightarrow\) Formula 2.

We will initially ignore the bias term \(b_h\) and the activation function g( ) because the transformation will be identical for each formula. So what we really want to compare is the result of the following parameters inside each formula:

\[ W_{h}[h^{\langle t-1\rangle},x^{\langle t\rangle}] \quad \Leftrightarrow \quad W_{hh}h^{\langle t-1\rangle} \oplus W_{hx}x^{\langle t\rangle} \]

We'll see how to do this using matrix multiplication combined with the data and techniques (stacking/concatenating) from above.

The Data

w_hh = numpy.full((3, 2), 1)
w_hx = numpy.full((3, 3), 9)
h_t_prev = numpy.full((2, 1), 1)
x_t = numpy.full((3, 1), 9)

Formula 1

stack_1 = numpy.hstack((w_hh, w_hx))
stack_2 = numpy.vstack((h_t_prev, x_t))
print("\nFormula 1")
formula_1 = numpy.matmul(stack_1,

Formula 1
 [[1 1 9 9 9]
 [1 1 9 9 9]
 [1 1 9 9 9]]

Formula 2

term_1 = numpy.matmul(w_hh, h_t_prev)
term_2 = numpy.matmul(w_hx, x_t)
print("\nFormula 2")
print("Term1:\n", term_1)
print("Term2:\n", term_2)

formula_2 = term_1 + term_2
print(formula_2, "\n")

Formula 2



np.allclose checks that each entry in one array is within a certain tolerance of the corresponding entry in another. For this example we're using integers, so you could probably use all(a == b) but otherwise, when you have floats, it's better to use allclose since floats won't always be exact.

print("-- Verify --")
print("Results are the same :", numpy.allclose(formula_1, formula_2))
print(f"Also the same: {all(formula_1==formula_2)}")
-- Verify --
Results are the same : True
Also the same: True

Now we'll add a sigmoid activation function and bias term as a final check so we can see how this would work in action.

def sigmoid(x: numpy.ndarray) -> numpy.ndarray:
    """Calculates the sigmoid of x

     x: numpy array or list or float
    return 1 / (1 + numpy.exp(-x))
bias = numpy.random.standard_normal((formula_1.shape[0], 1))
print("Formula 1 Output:\n", sigmoid(formula_1 + bias))
print("Formula 2 Output:\n", sigmoid(formula_2 + bias))

assert numpy.allclose(sigmoid(formula_1 + bias), sigmoid(formula_2 + bias))
Formula 1 Output:
Formula 2 Output:

Tensorflow Docker Setup

I recently re-started using tensorflow and the python interpreter kept crashing. It appears that they compiled the latest version to require AVX2 and the server I was using has AVX but not AVX2. I couldn't find any documentation about this requirement, but running the code on a different machine that has both AVX and AVX2 got rid of the problem. This might be a transient problem, as the nightly build doesn't crash on either machine, but trying to run the nightly build with other code is a nightmare as it seems that every framework related to tensorflow tries to revert the version back to the broken one, so I gave up and changed machines. The process of setting up cuda and tensorflow over and over again proved difficult, as there's different ways to do it (through apt, using nvidia installers, building from source) and each presents a different problem. The version apt installs, for instance puts the folders in places the tensorflow file can't figure out (if you build tensorflow from source) and using the nvidia debian package for cudnn left my packages in a broken state, as it was trying to install something that then broke another packages requirements… Anyway, I'm going to try and avoid building tensorflow from source and run everything from docker containers.

Setting Up

I don't know for sure that this is necessary, but I followed nvidia's docker installation instructions. If nothing else you can use it to check that the setup works. After that I setup tensorflow's container with a dockerfile:

FROM tensorflow/tensorflow:latest-gpu-py3-jupyter
RUN apt-get update && \
        apt-get install openssh-server --yes && \
        echo "Adding neurotic user" && \
        useradd --create-home --shell /bin/bash neurotic
COPY authorized_keys /home/neurotic/.ssh/
ENTRYPOINT service ssh restart && bash

The latest tensorflow container comes with python 2.7 as the default for some reason, and all the dependencies are installed with it in mind so to get python 3 (3.6 as of now) you need to specify the py3 tag like I did in the from line. Additionally I use ssh-forwarding for jupyter kernels so I can work in emacs with them so I installed the ssh-server and also created a non-root user to run jupyter. The last line ENTRYPOINT service ssh restart && bash makes sure the ssh-server is running and opens up a bash shell. To build the container I used this command:

docker build -t neurotic-tensorflow .

This creates an image named neurotic-tensorflow. To run it I use this command:

docker run --gpus all -p 2222:22 --name data-neurotic \
       --mount type=bind,source=$HOME/projects/neurotic-networks,target=/home/neurotic/neurotic-networks \
       --mount type=bind,source=/media/data,target=/home/neurotic/data \
       -it neurotic-tensorflow bash

The --gpus all makes the GPUs available. The -p 2222:22 flag maps the ssh-server in the container to port 2222 on the host. This allows you to ssh into the container using ssh neurotic@localhost -p 2222 without knowing the IP address of the container. You can also grab the IP address and then ssh into it like it's another machine on the network:

docker inspect --format "{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}" data-neurotic

Where data-neurotic is the name given to the container in the docker run command, but the advantage of the port mapping is that:

  • You don't need to know the address of the container if you are on the host machine.
  • You can ssh into the container from another machine by substituting the host's IP address for localhost in the ssh command

The mount options mount some folders into the container so we can share files.

Once you've run it you can restart it at any time using:

docker start data-neurotic

And if you need to run something as root you can attach the running container.

docker attach data-neurotic

NOTE: The python 3 container has cuda 10.1 installed but the latest version of tensorflow expects 11.0 - and tensorflow seems to use hard-coded names. So to make it work you either have to upgrade cuda or symlink the file and rename it to look like the newer version.

ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/ /usr/lib/x86_64-linux-gnu/

Tensorflow dependencies are incredibly convoluted and broken all over the place.

Sentiment Analysis: Testing the Model


Having trained our Deep Learning model for Sentiment Analysis previously we're now going to test how well it did.


# python
from argparse import Namespace
from functools import partial
from pathlib import Path

# pypi
import nltk
import trax.fastmath.numpy as numpy
import trax.layers as trax_layers

# this project
from neurotic.nlp.twitter.sentiment_network import SentimentNetwork
from neurotic.nlp.twitter.tensor_generator import TensorBuilder, TensorGenerator

Set Up


This is because of all the trouble getting trax and tensorflow working with CUDA means I have to keep re-building the Docker container I'm using.

data_path = Path("~/data/datasets/nltk_data/").expanduser()"twitter_samples", download_dir=str(data_path))

The Data Generators

converter = TensorBuilder()
train_generator = partial(TensorGenerator, converter,

VALIDATION_GENERATOR = valid_generator()
SIZE_OF_VOCABULARY = len(converter.vocabulary)

OUTPUT_PATH = Path("~/models").expanduser()
if not OUTPUT_PATH.is_dir():

The Model Builder

trainer = SentimentNetwork(
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Step    110: Ran 10 train steps in 4.89 secs
Step    110: train CrossEntropyLoss |  0.00662578
Step    110: eval  CrossEntropyLoss |  0.00139236
Step    110: eval          Accuracy |  1.00000000

Step    120: Ran 10 train steps in 2.61 secs
Step    120: train CrossEntropyLoss |  0.03323080
Step    120: eval  CrossEntropyLoss |  0.00684100
Step    120: eval          Accuracy |  1.00000000

Step    130: Ran 10 train steps in 1.27 secs
Step    130: train CrossEntropyLoss |  0.11124543
Step    130: eval  CrossEntropyLoss |  0.00011413
Step    130: eval          Accuracy |  1.00000000

Step    140: Ran 10 train steps in 0.71 secs
Step    140: train CrossEntropyLoss |  0.03609489
Step    140: eval  CrossEntropyLoss |  0.00000590
Step    140: eval          Accuracy |  1.00000000

Step    150: Ran 10 train steps in 1.92 secs
Step    150: train CrossEntropyLoss |  0.08605278
Step    150: eval  CrossEntropyLoss |  0.00003427
Step    150: eval          Accuracy |  1.00000000

Step    160: Ran 10 train steps in 1.31 secs
Step    160: train CrossEntropyLoss |  0.04926774
Step    160: eval  CrossEntropyLoss |  0.00003597
Step    160: eval          Accuracy |  1.00000000

Step    170: Ran 10 train steps in 1.30 secs
Step    170: train CrossEntropyLoss |  0.00986138
Step    170: eval  CrossEntropyLoss |  0.00026259
Step    170: eval          Accuracy |  1.00000000

Step    180: Ran 10 train steps in 0.76 secs
Step    180: train CrossEntropyLoss |  0.00773767
Step    180: eval  CrossEntropyLoss |  0.00038017
Step    180: eval          Accuracy |  1.00000000

Step    190: Ran 10 train steps in 1.35 secs
Step    190: train CrossEntropyLoss |  0.00555876
Step    190: eval  CrossEntropyLoss |  0.00000706
Step    190: eval          Accuracy |  1.00000000

Step    200: Ran 10 train steps in 0.76 secs
Step    200: train CrossEntropyLoss |  0.00381955
Step    200: eval  CrossEntropyLoss |  0.00000122
Step    200: eval          Accuracy |  1.00000000

The Accuracy

This is from the last post. I havent' figured out how to arrange all the code yet.

def compute_accuracy(preds: numpy.ndarray,
                     y: numpy.ndarray,
                     y_weights: numpy.ndarray) -> tuple:
    """Compute a batch accuracy

       preds: a tensor of shape (dim_batch, output_dim) 
       y: a tensor of shape (dim_batch,) with the true labels
       y_weights: a n.ndarray with the a weight for each example

       accuracy: a float between 0-1 
       weighted_num_correct (np.float32): Sum of the weighted correct predictions
       sum_weights (np.float32): Sum of the weights
    # Create an array of booleans, 
    # True if the probability of positive sentiment is greater than
    # the probability of negative sentiment
    # else False
    is_pos =  preds[:, 1] > preds[:, 0]

    # convert the array of booleans into an array of np.int32
    is_pos_int = is_pos.astype(numpy.int32)

    # compare the array of predictions (as int32) with the target (labels) of type int32
    correct = is_pos_int == y

    # Count the sum of the weights.
    sum_weights = y_weights.sum()

    # convert the array of correct predictions (boolean) into an arrayof np.float32
    correct_float = correct.astype(numpy.float32)

    # Multiply each prediction with its corresponding weight.
    weighted_correct_float =

    # Sum up the weighted correct predictions (of type np.float32), to go in the
    # denominator.
    weighted_num_correct = weighted_correct_float.sum()

    # Divide the number of weighted correct predictions by the sum of the
    # weights.
    accuracy = weighted_num_correct/sum_weights

    return accuracy, weighted_num_correct, sum_weights


Testing the model on Validation Data

Now we'll test our model's prediction accuracy on validation data.

This program will take in a data generator and the model.

  • The generator allows us to get batches of data. You can use it with a for loop:
for batch in iterator: 
   # do something with that batch

batch has dimensions (X, Y, weights).

  • Column 0 corresponds to the tweet as a tensor (input).
  • Column 1 corresponds to its target (actual label, positive or negative sentiment).
  • Column 2 corresponds to the weights associated (example weights)
  • You can feed the tweet into model and it will return the predictions for the batch.
# GRADED FUNCTION: test_model
def test_model(generator: TensorGenerator, model: trax_layers.Serial) -> float:
    """Calculate the accuracy of the model

       generator: an iterator instance that provides batches of inputs and targets
       model: a model instance 
       accuracy: float corresponding to the accuracy

    accuracy = 0.
    total_num_correct = 0
    total_num_pred = 0

    ### START CODE HERE (Replace instances of 'None' with your code) ###
    for batch in generator: 

        # Retrieve the inputs from the batch
        inputs = batch[0]

        # Retrieve the targets (actual labels) from the batch
        targets = batch[1]

        # Retrieve the example weight.
        example_weight = batch[2]

        # Make predictions using the inputs
        pred = model(inputs)

        # Calculate accuracy for the batch by comparing its predictions and targets
        batch_accuracy, batch_num_correct, batch_num_pred = compute_accuracy(
            pred, targets, example_weight)

        # Update the total number of correct predictions
        # by adding the number of correct predictions from this batch
        total_num_correct += batch_num_correct

        # Update the total number of predictions 
        # by adding the number of predictions made for the batch
        total_num_pred += batch_num_pred

    # Calculate accuracy over all examples
    accuracy = total_num_correct/total_num_pred

    ### END CODE HERE ###
    return accuracy
# testing the accuracy of your model: this takes around 20 seconds
model = trainer.training_loop.eval_model

# we used all the data for the training and validation (oops)
# so we don't have any test data. Fix that later
#accuracy = test_model(VALIDATION_GENERATOR, model)
generator = valid_generator(infinite=False)
accuracy = test_model(generator, model)
print(f'The accuracy of your model on the validation set is {accuracy:.4f}', )
The accuracy of your model on the validation set is 0.9995

Testing Some Custom Input

Finally, let's test some custom input. You will see that deepnets are more powerful than the older methods we have used before. Although we got close to 100% accuracy using Naive Bayes and Logistic Regression, that was because the task was way easier.

This is used to predict on a new sentence.

def predict(sentence: str) -> tuple:
    """Predicts the sentiment of the sentence

     sentence to get the sentiment for

     predictions, sentiment
    inputs = numpy.array(converter.to_tensor(sentence))

    # Batch size 1, add dimension for batch, to work with the model
    inputs = inputs.reshape(1, len(inputs))

    # predict with the model
    probabilities = model(inputs)

    # Turn probabilities into categories
    prediction = int(probabilities[0, 1] > probabilities[0, 0])

    sentiment = "positive" if prediction == 1 else "negative"

    return prediction, sentiment
sentence = "It's such a nice day, think i'll be taking Sid to Ramsgate fish and chips for lunch at Peter's fish factory and then the beach maybe"
inputs = numpy.array(converter.to_tensor(sentence))

A Positive Sentence

sentence = "It's such a nice day, think i'll be taking Sid to Ramsgate fish and chips for lunch at Peter's fish factory and then the beach maybe"
tmp_pred, tmp_sentiment = predict(sentence)
print(f"The sentiment of the sentence \n***\n\"{sentence}\"\n***\nis {tmp_sentiment}.")
The sentiment of the sentence 
"It's such a nice day, think i'll be taking Sid to Ramsgate fish and chips for lunch at Peter's fish factory and then the beach maybe"
is positive.

A Negative Sentence

sentence = "I hated my day, it was the worst, I'm so sad."
tmp_pred, tmp_sentiment = predict(sentence)
print(f"The sentiment of the sentence \n***\n\"{sentence}\"\n***\nis {tmp_sentiment}.")
The sentiment of the sentence 
"I hated my day, it was the worst, I'm so sad."
is negative.

Notice that the model works well even for complex sentences.

On Pooh

s = "Oh, bother!"
print(f"{s}: {predict(s)}")
Oh, bother!: (0, 'negative')

On Deep Nets

Deep nets allow you to understand and capture dependencies that you would have not been able to capture with a simple linear regression, or logistic regression.

  • It also allows you to better use pre-trained embeddings for classification and tends to generalize better.


So, there you have it, a Deep Learning Model for Sentiment Analysis built using Trax. Here are the prior posts in this series.

Sentiment Analysis: Training the Model

Training the Model

In the previous post we defined our Deep Learning model for Sentiment Analysis. Now we'll turn to training it on our data.

To train a model on a task, Trax defines an abstraction which packages the training data, loss and optimizer (among other things) together into an object.

Similarly to training a model, Trax defines an abstraction which packages the eval data and metrics (among other things) into another object.

The final piece tying things together is the abstraction that is a very simpl eand flexible way to put everything together and train the model, all the while evaluating it and saving checkpoints. Using Loop will save you a lot of code compared to always writing the training loop by hand, like you did in courses 1 and 2. More importantly, you are less likely to have a bug in that code that would ruin your training.


# from python
from functools import partial
from pathlib import Path

import random

# from pypi
from trax.supervised import training

import nltk
import trax
import trax.layers as trax_layers
import trax.fastmath.numpy as numpy

# this project
from neurotic.nlp.twitter.tensor_generator import TensorBuilder, TensorGenerator

The Dataset


converter = TensorBuilder()

train_generator = partial(TensorGenerator, converter,
training_generator = train_generator()

valid_generator = partial(TensorGenerator,
validation_generator = valid_generator()

size_of_vocabulary = len(converter.vocabulary)

Here's the Model

This was defined in the last post. It seems like too much trouble not to just copy it over.

def classifier(vocab_size: int=size_of_vocabulary,
               embedding_dim: int=256,
               output_dim: int=2) -> trax_layers.Serial:
    """Creates the classifier model

     vocab_size: number of tokens in the training vocabulary
     embedding_dim: output dimension for the Embedding layer
     output_dim: dimension for the Dense layer

     the composed layer-model
    embed_layer = trax_layers.Embedding(
        vocab_size=vocab_size, # Size of the vocabulary
        d_feature=embedding_dim)  # Embedding dimension

    mean_layer = trax_layers.Mean(axis=1)

    dense_output_layer = trax_layers.Dense(n_units = output_dim)

    log_softmax_layer = trax_layers.LogSoftmax()

    model = trax_layers.Serial(
    return model

Now to train the model.

First define the TrainTask, EvalTask and Loop in preparation to training the model.


# train_generator(batch_size=batch_size, shuffle=True),

train_task = training.TrainTask(

eval_task = training.EvalTask(
    metrics=[trax_layers.CrossEntropyLoss(), trax_layers.Accuracy()],

model = classifier()

This defines a model trained using tl.CrossEntropyLoss optimized with the trax.optimizers.Adam optimizer, all the while tracking the accuracy using tl.Accuracy metric. We also track tl.CrossEntropyLoss on the validation set.

Now let's make an output directory and train the model.

output_path = Path("~/models/").expanduser()
if not output_path.is_dir():
def train_model(classifier, train_task, eval_task, n_steps, output_dir):
    """Create and run the training loop

       classifier - the model you are building
       train_task - Training task
       eval_task - Evaluation task
       n_steps - the evaluation steps
       output_dir - folder to save your files
       trainer -  trax trainer
    training_loop = training.Loop(
                                model=classifier, # The learning model
                                tasks=train_task, # The training task
                                eval_tasks = eval_task, # The evaluation task
                                output_dir = output_dir) # The output directory = n_steps)
    # Return the training_loop, since it has the model.
    return training_loop
training_loop = train_model(model, train_task, eval_task, 100, output_path)

Step    110: Ran 10 train steps in 6.06 secs
Step    110: train CrossEntropyLoss |  0.00527583
Step    110: eval  CrossEntropyLoss |  0.00304692
Step    110: eval          Accuracy |  1.00000000

Step    120: Ran 10 train steps in 2.06 secs
Step    120: train CrossEntropyLoss |  0.02130376
Step    120: eval  CrossEntropyLoss |  0.00000677
Step    120: eval          Accuracy |  1.00000000

Step    130: Ran 10 train steps in 0.75 secs
Step    130: train CrossEntropyLoss |  0.01026674
Step    130: eval  CrossEntropyLoss |  0.00424393
Step    130: eval          Accuracy |  1.00000000

Step    140: Ran 10 train steps in 1.33 secs
Step    140: train CrossEntropyLoss |  0.00172522
Step    140: eval  CrossEntropyLoss |  0.00004072
Step    140: eval          Accuracy |  1.00000000

Step    150: Ran 10 train steps in 0.77 secs
Step    150: train CrossEntropyLoss |  0.00002847
Step    150: eval  CrossEntropyLoss |  0.00000232
Step    150: eval          Accuracy |  1.00000000

Step    160: Ran 10 train steps in 0.78 secs
Step    160: train CrossEntropyLoss |  0.00002123
Step    160: eval  CrossEntropyLoss |  0.00104654
Step    160: eval          Accuracy |  1.00000000

Step    170: Ran 10 train steps in 0.79 secs
Step    170: train CrossEntropyLoss |  0.00001706
Step    170: eval  CrossEntropyLoss |  0.00000080
Step    170: eval          Accuracy |  1.00000000

Step    180: Ran 10 train steps in 0.83 secs
Step    180: train CrossEntropyLoss |  0.00001554
Step    180: eval  CrossEntropyLoss |  0.00000989
Step    180: eval          Accuracy |  1.00000000

Step    190: Ran 10 train steps in 0.85 secs
Step    190: train CrossEntropyLoss |  0.00639312
Step    190: eval  CrossEntropyLoss |  0.00255337
Step    190: eval          Accuracy |  1.00000000

Step    200: Ran 10 train steps in 0.85 secs
Step    200: train CrossEntropyLoss |  0.00124322
Step    200: eval  CrossEntropyLoss |  0.02190475
Step    200: eval          Accuracy |  1.00000000

Bundle It Up









# python
from pathlib import Path

# from pypi
from trax.supervised import training

import attr
import trax
import trax.layers as trax_layers

The Trainer

class SentimentNetwork:
    """Builds and Trains the Sentiment Analysis Model

     training_generator: generator of training batches
     validation_generator: generator of validation batches
     vocabulary_size: number of tokens in the training vocabulary
     training_loops: number of times to run the training loop
     output_path: path to where to store the model
     embedding_dimension: output dimension for the Embedding layer
     output_dimension: dimension for the Dense layer
    vocabulary_size: int
    training_generator: object
    validation_generator: object
    training_loops: int
    output_path: Path
    embedding_dimension: int=256
    output_dimension: int=2
    _model: trax_layers.Serial=None
    _training_task: training.TrainTask=None
    _evaluation_task: training.EvalTask=None
    _training_loop: training.Loop=None
  • The Model
    def model(self) -> trax_layers.Serial:
        """The Embeddings model"""
        if self._model is None:
            self._model = trax_layers.Serial(
        return self._model
  • The Training Task
    def training_task(self) -> training.TrainTask:
        """The training task for training the model"""
        if self._training_task is None:
            self._training_task = training.TrainTask(
        return self._training_task
  • Evaluation Task
    def evaluation_task(self) -> training.EvalTask:
        """The validation evaluation task"""
        if self._evaluation_task is None:
            self._evaluation_task = training.EvalTask(
        return self._evaluation_task
  • Training Loop
    def training_loop(self) -> training.Loop:
        """The thing to run the training"""
        if self._training_loop is None:
            self._training_loop = training.Loop(
                output_dir= self.output_path) 
        return self._training_loop
  • Fitting the Model
    def fit(self):
        """Runs the training loop"""

Practice In Making Predictions

Now that you have trained a model, you can access it as training_loop.model object. We will actually use training_loop.eval_model and in the next weeks you will learn why we sometimes use a different model for evaluation, e.g., one without dropout. For now, make predictions with your model.

Use the training data just to see how the prediction process works.

  • Later, you will use validation data to evaluate your model's performance.

Create a generator object.

tmp_train_generator = train_generator(batch_size=16)

Get one batch.

tmp_batch = next(tmp_train_generator)

Position 0 has the model inputs (tweets as tensors). Position 1 has the targets (the actual labels).

tmp_inputs, tmp_targets, tmp_example_weights = tmp_batch

print(f"The batch is a tuple of length {len(tmp_batch)} because position 0 contains the tweets, and position 1 contains the targets.") 
print(f"The shape of the tweet tensors is {tmp_inputs.shape} (num of examples, length of tweet tensors)")
print(f"The shape of the labels is {tmp_targets.shape}, which is the batch size.")
print(f"The shape of the example_weights is {tmp_example_weights.shape}, which is the same as inputs/targets size.")
The batch is a tuple of length 3 because position 0 contains the tweets, and position 1 contains the targets.
The shape of the tweet tensors is (16, 14) (num of examples, length of tweet tensors)
The shape of the labels is (16,), which is the batch size.
The shape of the example_weights is (16,), which is the same as inputs/targets size.

Feed the tweet tensors into the model to get a prediction.

tmp_pred = training_loop.eval_model(tmp_inputs)
print(f"The prediction shape is {tmp_pred.shape}, num of tensor_tweets as rows")
print("Column 0 is the probability of a negative sentiment (class 0)")
print("Column 1 is the probability of a positive sentiment (class 1)")
print("View the prediction array")
The prediction shape is (16, 2), num of tensor_tweets as rows
Column 0 is the probability of a negative sentiment (class 0)
Column 1 is the probability of a positive sentiment (class 1)

View the prediction array
[[-1.2960873e+01 -2.3841858e-06]
 [-5.6474457e+00 -3.5326481e-03]
 [-5.3460855e+00 -4.7781467e-03]
 [-7.6736917e+00 -4.6515465e-04]
 [-5.2682662e+00 -5.1658154e-03]
 [-1.0566207e+01 -2.5749207e-05]
 [-5.6388092e+00 -3.5634041e-03]
 [-3.9540453e+00 -1.9363165e-02]
 [ 0.0000000e+00 -2.0700916e+01]
 [ 0.0000000e+00 -2.2949795e+01]
 [ 0.0000000e+00 -2.3168846e+01]
 [ 0.0000000e+00 -2.4553205e+01]
 [-9.5367432e-07 -1.3878939e+01]
 [ 0.0000000e+00 -1.6655178e+01]
 [ 0.0000000e+00 -1.5975946e+01]
 [ 0.0000000e+00 -2.0577690e+01]]

To turn these probabilities into categories (negative or positive sentiment prediction), for each row:

  • Compare the probabilities in each column.
  • If column 1 has a value greater than column 0, classify that as a positive tweet.
  • Otherwise if column 1 is less than or equal to column 0, classify that example as a negative tweet.

Turn probabilites into category predictions.

tmp_is_positive = tmp_pred[:,1] > tmp_pred[:,0]
for i, p in enumerate(tmp_is_positive):
    print(f"Neg log prob {tmp_pred[i,0]:.4f}\tPos log prob {tmp_pred[i,1]:.4f}\t is positive? {p}\t actual {tmp_targets[i]}")
Neg log prob -12.9609   Pos log prob -0.0000     is positive? True       actual 1
Neg log prob -5.6474    Pos log prob -0.0035     is positive? True       actual 1
Neg log prob -5.3461    Pos log prob -0.0048     is positive? True       actual 1
Neg log prob -7.6737    Pos log prob -0.0005     is positive? True       actual 1
Neg log prob -5.2683    Pos log prob -0.0052     is positive? True       actual 1
Neg log prob -10.5662   Pos log prob -0.0000     is positive? True       actual 1
Neg log prob -5.6388    Pos log prob -0.0036     is positive? True       actual 1
Neg log prob -3.9540    Pos log prob -0.0194     is positive? True       actual 1
Neg log prob 0.0000     Pos log prob -20.7009    is positive? False      actual 0
Neg log prob 0.0000     Pos log prob -22.9498    is positive? False      actual 0
Neg log prob 0.0000     Pos log prob -23.1688    is positive? False      actual 0
Neg log prob 0.0000     Pos log prob -24.5532    is positive? False      actual 0
Neg log prob -0.0000    Pos log prob -13.8789    is positive? False      actual 0
Neg log prob 0.0000     Pos log prob -16.6552    is positive? False      actual 0
Neg log prob 0.0000     Pos log prob -15.9759    is positive? False      actual 0
Neg log prob 0.0000     Pos log prob -20.5777    is positive? False      actual 0

Notice that since you are making a prediction using a training batch, it's more likely that the model's predictions match the actual targets (labels).

  • Every prediction that the tweet is positive is also matching the actual target of 1 (positive sentiment).
  • Similarly, all predictions that the sentiment is not positive matches the actual target of 0 (negative sentiment)

One more useful thing to know is how to compare if the prediction is matching the actual target (label).

  • The result of calculation is_positive is a boolean.
  • The target is a type trax.fastmath.numpy.int32
  • If you expect to be doing division, you may prefer to work with decimal numbers with the data type type trax.fastmath.numpy.int32

View the array of booleans.

print("Array of booleans")
Array of booleans
DeviceArray([ True,  True,  True,  True,  True,  True,  True,  True,
             False, False, False, False, False, False, False, False],            dtype=bool)

Convert booleans to type int32.

  • True is converted to 1
  • False is converted to 0
tmp_is_positive_int = tmp_is_positive.astype(trax.fastmath.numpy.int32)

View the array of integers.

print("Array of integers")
Array of integers
DeviceArray([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Convert boolean to type float32.

tmp_is_positive_float = tmp_is_positive.astype(numpy.float32)

View the array of floats.

print("Array of floats")
Array of floats
DeviceArray([1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
             0.], dtype=float32)
(16, 2)

Note that Python usually does type conversion for you when you compare a boolean to an integer.

  • True compared to 1 is True, otherwise any other integer is False.
  • False compared to 0 is True, otherwise any ohter integer is False.
print(f"True == 1: {True == 1}")
print(f"True == 2: {True == 2}")
print(f"False == 0: {False == 0}")
print(f"False == 2: {False == 2}")
True == 1: True
True == 2: False
False == 0: True
False == 2: False

However, we recommend that you keep track of the data type of your variables to avoid unexpected outcomes. So it helps to convert the booleans into integers.

Compare 1 to 1 rather than comparing True to 1.

Hopefully you are now familiar with what kinds of inputs and outputs the model uses when making a prediction.

  • This will help you implement a function that estimates the accuracy of the model's predictions.


5.1 Computing the accuracy of a batch

You will now write a function that evaluates your model on the validation set and returns the accuracy.

  • preds contains the predictions.
  • Its dimensions are (batch_size, output_dim). output_dim is two in this case. Column 0 contains the probability that the tweet belongs to class 0 (negative sentiment). Column 1 contains probability that it belongs to class 1 (positive sentiment).
  • If the probability in column 1 is greater than the probability in column 0, then interpret this as the model's prediction that the example has label 1 (positive sentiment).
  • Otherwise, if the probabilities are equal or the probability in column 0 is higher, the model's prediction is 0 (negative sentiment).
  • y contains the actual labels.
  • y_weights contains the weights to give to predictions.
def compute_accuracy(preds: numpy.ndarray,
                     y: numpy.ndarray,
                     y_weights: numpy.ndarray) -> tuple:
    """Compute a batch accuracy

       preds: a tensor of shape (dim_batch, output_dim) 
       y: a tensor of shape (dim_batch,) with the true labels
       y_weights: a n.ndarray with the a weight for each example

       accuracy: a float between 0-1 
       weighted_num_correct (np.float32): Sum of the weighted correct predictions
       sum_weights (np.float32): Sum of the weights
    # Create an array of booleans, 
    # True if the probability of positive sentiment is greater than
    # the probability of negative sentiment
    # else False
    is_pos =  preds[:, 1] > preds[:, 0]

    # convert the array of booleans into an array of np.int32
    is_pos_int = is_pos.astype(numpy.int32)

    # compare the array of predictions (as int32) with the target (labels) of type int32
    correct = is_pos_int == y

    # Count the sum of the weights.
    sum_weights = y_weights.sum()

    # convert the array of correct predictions (boolean) into an arrayof np.float32
    correct_float = correct.astype(numpy.float32)

    # Multiply each prediction with its corresponding weight.
    weighted_correct_float =

    # Sum up the weighted correct predictions (of type np.float32), to go in the
    # denominator.
    weighted_num_correct = weighted_correct_float.sum()

    # Divide the number of weighted correct predictions by the sum of the
    # weights.
    accuracy = weighted_num_correct/sum_weights

    return accuracy, weighted_num_correct, sum_weights

Get one batch.

tmp_val_generator = valid_generator(batch_size=64)
tmp_batch = next(tmp_val_generator)

Position 0 has the model inputs (tweets as tensors) position 1 has the targets (the actual labels)

tmp_inputs, tmp_targets, tmp_example_weights = tmp_batch

Feed the tweet tensors into the model to get a prediction.

tmp_pred = training_loop.eval_model(tmp_inputs)
tmp_acc, tmp_num_correct, tmp_num_predictions = compute_accuracy(preds=tmp_pred, y=tmp_targets, y_weights=tmp_example_weights)

print(f"Model's prediction accuracy on a single training batch is: {100 * tmp_acc}%")
print(f"Weighted number of correct predictions {tmp_num_correct}; weighted number of total observations predicted {tmp_num_predictions}")
Model's prediction accuracy on a single training batch is: 100.0%
Weighted number of correct predictions 64.0; weighted number of total observations predicted 64


Now that we have a trained model, in the next post we'll test how well it did.

Sentiment Analysis: Defining the Model


This continues a series on sentiment analysis with deep learning. In the previous post we loaded and processed our data set. In this post we'll see about actually defining the Neural Network.

In this part we will write your own library of layers. It will be very similar to the one used in Trax and also in Keras and PyTorch. The intention is that in writing our own small framework will help us understand how they all work and use them more effectively in the future.


# from pypi
from expects import be_true, expect
from trax import fastmath

import attr
import numpy
import trax
import trax.layers as trax_layers

# this project
from neurotic.nlp.twitter.tensor_generator import TensorBuilder

Set Up

Some aliases to get closer to what the notebook has.

numpy_fastmath = fastmath.numpy
random = fastmath.random


The Base Layer Class

This will be the base class that the others will inherit from.

class Layer:
    """Base class for layers
    def forward(self, x: numpy.ndarray):
        """The forward propagation method

        NotImplementedError - method is called but child hasn't implemented it
        raise NotImplementedError

    def init_weights_and_state(self, input_signature, random_key):
        """method to initialize the weights
       based on the input signature and random key,
       be implemented by subclasses of this Layer class
        raise NotImplementedError

    def init(self, input_signature, random_key) -> numpy.ndarray:
        """initializes and returns the weights

        This is just an alias for the ``init_weights_and_state``
       method for some reason

        input_signature: who knows?
        random_key: once again, who knows?

        the weights
        self.init_weights_and_state(input_signature, random_key)
        return self.weights

    def __call__(self, x) -> numpy.ndarray:
        """This is an alias for the ``forward`` method

        x: input array

        whatever the ``forward`` method does
        return self.forward(x)

The ReLU class

Here's the ReLU function:

\[ \mathrm{ReLU}(x) = \mathrm{max}(0,x) \]

We'll implement the ReLU activation function below. The function will take in a matrix or vector and it transform all the negative numbers into 0 while keeping all the positive numbers intact.

Please use numpy.maximum(A,k) to find the maximum between each element in A and a scalar k.

class Relu(Layer):
    """Relu activation function implementation"""
    def forward(self, x: numpy.ndarray) -> numpy.ndarray:
        """"Performs the activation

           - x: the input

           - activation: all positive or 0 version of x
        return numpy.maximum(x, 0)

Test It

x = numpy.array([[-2.0, -1.0, 0.0], [0.0, 1.0, 2.0]], dtype=float)
relu_layer = Relu()
print("Test data is:")
print("\nOutput of Relu is:")
actual = relu_layer(x)


expected = numpy.array([[0., 0., 0.],
                        [0., 1., 2.]])

expect(numpy.allclose(actual, expected)).to(be_true)
Test data is:
[[-2. -1.  0.]
 [ 0.  1.  2.]]

Output of Relu is:
[[0. 0. 0.]
 [0. 1. 2.]]

The Dense class

Implement the forward function of the Dense class.

  • The forward function multiplies the input to the layer (x) by the weight matrix (W).

\[ \mathrm{forward}(\mathbf{x},\mathbf{W}) = \mathbf{xW} \]

  • You can use to perform the matrix multiplication.

Note that for more efficient code execution, you will use the trax version of math, which includes a trax version of numpy and also random.

Implement the weight initializer new_weights function

  • Weights are initialized with a random key.
  • The second parameter is a tuple for the desired shape of the weights (num_rows, num_cols)
  • The num of rows for weights should equal the number of columns in x, because for forward propagation, you will multiply x times weights.

Please use trax.fastmath.random.normal(key, shape, dtype=tf.float32) to generate random values for the weight matrix. The key difference between this function and the standard numpy randomness is the explicit use of random keys, which need to be passed in. While it can look tedious at the first sight to pass the random key everywhere, you will learn in Course 4 why this is very helpful when implementing some advanced models.

  • key can be generated by calling random.get_prng(seed) and passing in a number for the seed.
  • shape is a tuple with the desired shape of the weight matrix.
    • The number of rows in the weight matrix should equal the number of columns in the variable x. Since x may have 2 dimensions if it represents a single training example (row, col), or three dimensions (batch_size, row, col), get the last dimension from the tuple that holds the dimensions of x.
    • The number of columns in the weight matrix is the number of units chosen for that dense layer. Look at the __init__ function to see which variable stores the number of units.
  • dtype is the data type of the values in the generated matrix; keep the default of tf.float32. In this case, don't explicitly set the dtype (just let it use the default value).

Set the standard deviation of the random values to 0.1

  • The values generated have a mean of 0 and standard deviation of 1.
  • Set the default standard deviation stdev to be 0.1 by multiplying the standard deviation to each of the values in the weight matrix.

See how the fastmath.trax.random.normal function works.

tmp_key = random.get_prng(seed=1)
print("The random seed generated by random.get_prng")
The random seed generated by random.get_prng
DeviceArray([0, 1], dtype=uint32)

For some reason tensorflow can't find the GPU. Setting the log level to 0 like the message suggests shows that it gives up after trying to find a TPU, there's no indication that it's looking for the GPU.

import tensorflow

Hmmm. I'll have to troubleshoot that.

print("choose a matrix with 2 rows and 3 columns")
choose a matrix with 2 rows and 3 columns
(2, 3)

Generate a weight matrix Note that you'll get an error if you try to set dtype to tf.float32, where tf is tensorflow Just avoid setting the dtype and allow it to use the default data type

tmp_weight = random.normal(key=tmp_key, shape=tmp_shape)

print("Weight matrix generated with a normal distribution with mean 0 and stdev of 1")
Weight matrix generated with a normal distribution with mean 0 and stdev of 1
DeviceArray([[ 0.957307  , -0.9699291 ,  1.0070664 ],
             [ 0.36619022,  0.17294823,  0.29092228]], dtype=float32)
class Dense(Layer):
    A dense (fully-connected) layer.

     - n_units: the number of columns for our weight matrix
     - init_stdev: standard deviation for our initial weights
    n_units: int
    init_stdev: float=0.1

    def forward(self, x: numpy.ndarray) -> numpy.ndarray:
        """The dot product of the input and the weights

        x: input to multipyl

        product of x and weights
        return, self.weights)

    def init_weights_and_state(self, input_signature: tuple,
                               random_key: int) -> numpy.ndarray:
        """initializes the weights

        input_signature: tuple whose final dimension will be the number of rows
        random_ke: something to start the random normal generator with
        input_shape = input_signature.shape

        # to allow for more than two-dimensional matrices,
        # we use the last column of the input shape, rather than assuming it's
        # column 1
        self.weights = (random.normal(key=random_key,
                                      shape=(input_shape[-1], self.n_units))
             * self.init_stdev)
        return self.weights
dense_layer = Dense(n_units=10)  #sets  number of units in dense layer
random_key = random.get_prng(seed=0)  # sets random seed
z = numpy.array([[2.0, 7.0, 25.0]]) # input array 

dense_layer.init(z, random_key)
print("Weights are\n ",dense_layer.weights) #Returns randomly generated weights
output = dense_layer(z)
print("Foward function output is ", output) # Returns multiplied values of units and weights

expected_weights = numpy.array([
    [-0.02837108,  0.09368162, -0.10050076,  0.14165013,  0.10543301,  0.09108126,
     -0.04265672,  0.0986188,  -0.05575325,  0.00153249],
    [-0.20785688,  0.0554837,   0.09142365,  0.05744595,  0.07227863,  0.01210617,
     -0.03237354,  0.16234995,  0.02450038, -0.13809784],
    [-0.06111237,  0.01403724,  0.08410042, -0.1094358,  -0.10775021, -0.11396459,
     -0.05933381, -0.01557652, -0.03832145, -0.11144515]])

expected_output = numpy.array(
    [[-3.0395496,   0.9266802,   2.5414743,  -2.050473,   -1.9769388,  -2.582209,
      -1.7952735,   0.94427425, -0.8980402,  -3.7497487]])

expect(numpy.allclose(dense_layer.weights, expected_weights)).to(be_true)
expect(numpy.allclose(output, expected_output)).to(be_true)
Weights are
  [[-0.02837108  0.09368162 -0.10050076  0.14165013  0.10543301  0.09108126
  -0.04265672  0.0986188  -0.05575325  0.00153249]
 [-0.20785688  0.0554837   0.09142365  0.05744595  0.07227863  0.01210617
  -0.03237354  0.16234995  0.02450038 -0.13809784]
 [-0.06111237  0.01403724  0.08410042 -0.1094358  -0.10775021 -0.11396459
  -0.05933381 -0.01557652 -0.03832145 -0.11144515]]
Foward function output is  [[-3.03954965  0.92668021  2.54147445 -2.05047299 -1.97693891 -2.58220917
  -1.79527355  0.94427423 -0.89804017 -3.74974866]]

The Layers for the Trax-Based Model

For the model implementation we will use the Trax layers library. Trax layers are very similar to the ones we implemented above, but in addition to trainable weights they also have a non-trainable state. This state is used in layers like batch normalization and for inference - we will learn more about it later on.


First, look at the code of the Trax Dense layer and compare to the implementation above.

Another other important layer that we will use a lot is the Serial layer which allows us to execute one layer after another in sequence.

  • You can pass in the layers as arguments to Serial, separated by commas.
  • For example: tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))

The layer classes have pretty good docstrings, unlike the fastmath stuff, so it might be useful to look at it - but it's too long to include here.

We're also going to use an Embedding

  • tl.Embedding(vocab_size, d_feature).
  • vocab_size is the number of unique words in the given vocabulary.
  • d_feature is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
tmp_embed = trax_layers.Embedding(vocab_size=3, d_feature=2)

Another useful layer is the Mean which calculates means across an axis. In this case, use axis = 1 (across rows) to get an average embedding vector (an embedding vector that is an average of all words in the vocabulary).

  • For example, if the embedding matrix is 300 elements and vocab size is 10,000 words, taking the mean of the embedding matrix along axis=1 will yield a vector of 300 elements.

Pretend the embedding matrix uses 2 elements for embedding the meaning of a word and has a vocabulary size of 3, so it has shape (2,3).

tmp_embed = numpy.array([[1,2,3,],

First take the mean along axis 0, which creates a vector whose length equals the vocabulary size (the number of columns).

array([2.5, 3.5, 4.5])

If you take the mean along axis 1 it creates a vector whose length equals the number of elements in a word embedding (the rows).

array([2., 5.])

Finally, a LogSoftmax layer gives you a log-softmax output.

Online Documentation

For completeness, here's some links to the Read the Docs documentation for these layers.

The Classifier Function

builder = TensorBuilder()
size_of_vocabulary = len(builder.vocabulary)
def classifier(vocab_size: int=size_of_vocabulary,
               embedding_dim: int=256,
               output_dim: int=2) -> trax_layers.Serial:
    """Creates the classifier model

     vocab_size: number of tokens in the training vocabulary
     embedding_dim: output dimension for the Embedding layer
     output_dim: dimension for the Dense layer

     the composed layer-model
    embed_layer = trax_layers.Embedding(
        vocab_size=vocab_size, # Size of the vocabulary
        d_feature=embedding_dim)  # Embedding dimension

    mean_layer = trax_layers.Mean(axis=1)

    dense_output_layer = trax_layers.Dense(n_units = output_dim)

    log_softmax_layer = trax_layers.LogSoftmax()

    model = trax_layers.Serial(
    return model
tmp_model = classifier()
<class 'trax.layers.combinators.Serial'>


Now that we have our Deep Learning model, we'll move on to training it.

Sentiment Analysis: Pre-processing the Data


This is the next in a series about building a Deep Learning model for sentiment analysis. The first post was this one.


# from python
from argparse import Namespace

import random

# from pypi
from expects import contain_exactly, equal, expect
from nltk.corpus import twitter_samples

import nltk
import numpy

# this project
from neurotic.nlp.twitter.processor import TwitterProcessor

Set Up

The NLTK data has to be downloaded at least once."twitter_samples", download_dir="~/data/datasets/nltk_data/")


The NLTK Data

positive = twitter_samples.strings('positive_tweets.json')
negative = twitter_samples.strings('negative_tweets.json')

print(f"Positive Tweets: {len(positive):,}")
print(f"Negative Tweets: {len(negative):,}")
Positive Tweets: 5,000
Negative Tweets: 5,000

Split It Up

Instead of randomly splitting the data we're going to do a straight slice.

SPLIT = 4000

Split positive set into validation and training

positive_validation   = positive[SPLIT:]
positive_training  = positive[:SPLIT]

Split negative set into validation and training

negative_validation = negative[SPLIT:]
negative_training  = negative[:SPLIT]

Combine the Data Sets

The X data.

train_x = positive_training + negative_training
validation_x = positive_validation + negative_validation

The labels (1 for positive, 0 for negative).

train_y = numpy.append(numpy.ones(len(positive_training)),
validation_y  = numpy.append(numpy.ones(len(positive_validation)),

print(f"length of train_x {len(train_x):,}")
print(f"length of validation_x {len(validation_x):,}")
length of train_x 8,000
length of validation_x 2,000

Building the vocabulary

Now build the vocabulary.

  • Map each word in each tweet to an integer (an "index").
  • The following code does this for you, but please read it and understand what it's doing.
  • Note that you will build the vocabulary based on the training data.
  • To do so, you will assign an index to everyword by iterating over your training set.

The vocabulary will also include some special tokens

  • __PAD__: padding
  • </e>: end of line
  • __UNK__: a token representing any word that is not in the vocabulary.
Tokens = Namespace(padding="__PAD__", ending="__</e>__", unknown="__UNK__")
process = TwitterProcessor()
vocabulary = {Tokens.padding: 0, Tokens.ending: 1, Tokens.unknown: 2}
for tweet in train_x:
    for token in process(tweet):
        if token not in vocabulary:
            vocabulary[token] = len(vocabulary)
print(f"Words in the vocabulary: {len(vocabulary):,}")

count = 0
for token in vocabulary:
    print(f"{count}: {token}: {vocabulary[token]}")
    count += 1
    if count == 5:
Words in the vocabulary: 9,164
0: __PAD__: 0
1: __</e>__: 1
2: __UNK__: 2
3: followfriday: 3
4: top: 4

Converting a tweet to a tensor

Now we'll write a function that will convert each tweet to a tensor (a list of unique integer IDs representing the processed tweet).

  • Note, the returned data type will be a regular Python `list()`
    • You won't use TensorFlow in this function
    • You also won't use a numpy array
    • You also won't use trax.fastmath.numpy array
  • For words in the tweet that are not in the vocabulary, set them to the unique ID for the token `__UNK__`.

    For example, given this string:

'@happypuppy, is Maria happy?'

You first tokenize it.

['maria', 'happi']

Then convert each word into the index for it.

[2, 56]

Notice that the word "maria" is not in the vocabulary, so it is assigned the unique integer associated with the __UNK__ token, because it is considered "unknown."

def tweet_to_tensor(tweet: str, vocab_dict: dict,
                    unk_token: str='__UNK__', verbose: bool=False):
    """Convert a tweet to a list of indices

       tweet - A string containing a tweet
       vocab_dict - The words dictionary
       unk_token - The special string for unknown tokens
       verbose - Print info during runtime

       tensor_l - A python list with indices for the tweet tokens
    # Process the tweet into a list of words
    # where only important words are kept (stop words removed)
    word_l = processor(tweet)

    if verbose:
        print("List of words from the processed tweet:")

    # Initialize the list that will contain the unique integer IDs of each word
    tensor_l = []

    # Get the unique integer ID of the __UNK__ token
    unk_ID = vocab_dict[unk_token]

    if verbose:
        print(f"The unique integer ID for the unk_token is {unk_ID}")

    # for each word in the list:
    for word in word_l:

        # Get the unique integer ID.
        # If the word doesn't exist in the vocab dictionary,
        # use the unique ID for __UNK__ instead.
        word_ID = vocab_dict.get(word, unk_ID)

        # Append the unique integer ID to the tensor list.

    return tensor_l
print("Actual tweet is\n", positive_validation[0])
print("\nTensor of tweet:\n", tweet_to_tensor(positive_validation[0], vocab_dict=vocabulary))
Actual tweet is
 Bro:U wan cut hair anot,ur hair long Liao bo
Me:since ord liao,take it easy lor treat as save $ leave it longer :)
Bro:LOL Sibei xialan

Tensor of tweet:
 [1072, 96, 484, 2376, 750, 8220, 1132, 750, 53, 2, 2701, 796, 2, 2, 354, 606, 2, 3523, 1025, 602, 4599, 9, 1072, 158, 2, 2]
def test_tweet_to_tensor():
    test_cases = [

            "input": [positive_validation[1], vocabulary],
            "expected":[444, 2, 304, 567, 56, 9],
            "error":"The function gives bad output for val_pos[1]. Test failed"
            "input":[positive_validation[1], vocabulary],
            "error":"Datatype mismatch. Need only list not np.array"
            "input":[positive_validation[1], vocabulary],
            "error":"Unk word check not done- Please check if you included mapping for unknown word"
    count = 0
    for test_case in test_cases:        
            if test_case['name'] == "simple_test_check":
                assert test_case["expected"] == tweet_to_tensor(*test_case['input'])
                count += 1
            if test_case['name'] == "datatype_check":
                assert isinstance(tweet_to_tensor(*test_case['input']), test_case["expected"])
                count += 1
            if test_case['name'] == "without_unk_check":
                assert None not in tweet_to_tensor(*test_case['input'])
                count += 1

    if count == 3:
        print("\033[92m All tests passed")
        print(count," Tests passed out of 3")
The function gives bad output for val_pos[1]. Test failed
2  Tests passed out of 3

Their tweet processor wipes out everything after the start of a URL, even if it isn't part of the URL, so they have fewer tokens, so the indices won't match exactly.

Creating a batch generator

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets.

  • If instead of training with batches of examples, you were to train a model with one example at a time, it would take a very long time to train the model.
  • You will now build a data generator that takes in the positive/negative tweets and returns a batch of training examples. It returns the model inputs, the targets (positive or negative labels) and the weight for each target (ex: this allows us to treat some examples as more important to get right than others, but commonly this will all be 1.0).

Once you create the generator, you could include it in a for loop:

for batch_inputs, batch_targets, batch_example_weights in data_generator:

You can also get a single batch like this:

batch_inputs, batch_targets, batch_example_weights = next(data_generator)

The generator returns the next batch each time it's called.

  • This generator returns the data in a format (tensors) that you could directly use in your model.
  • It returns a triple: the inputs, targets, and loss weights:

– Inputs is a tensor that contains the batch of tweets we put into the model. – Targets is the corresponding batch of labels that we train to generate. – Loss weights here are just 1s with same shape as targets. Next week, you will use it to mask input padding.


A batch of spaghetti.

# GRADED: Data generator
def data_generator(data_pos: list, data_neg: list, batch_size: int,
                   loop: bool, vocab_dict: dict, shuffle: bool=False):
    """Generates batches of data

       data_pos - Set of positive examples
       data_neg - Set of negative examples
       batch_size - number of samples per batch. Must be even
       loop - True or False
       vocab_dict - The words dictionary
       shuffle - Shuffle the data order

       inputs - Subset of positive and negative examples
       targets - The corresponding labels for the subset
       example_weights - An array specifying the importance of each example        
    # make sure the batch size is an even number
    # to allow an equal number of positive and negative samples
    assert batch_size % 2 == 0

    # Number of positive examples in each batch is half of the batch size
    # same with number of negative examples in each batch
    n_to_take = batch_size // 2

    # Use pos_index to walk through the data_pos array
    # same with neg_index and data_neg
    pos_index = 0
    neg_index = 0

    len_data_pos = len(data_pos)
    len_data_neg = len(data_neg)

    # Get and array with the data indexes
    pos_index_lines = list(range(len_data_pos))
    neg_index_lines = list(range(len_data_neg))

    # shuffle lines if shuffle is set to True
    if shuffle:

    stop = False

    # Loop indefinitely
    while not stop:  

        # create a batch with positive and negative examples
        batch = []

        # First part: Pack n_to_take positive examples

        # Start from pos_index and increment i up to n_to_take
        for i in range(n_to_take):

            # If the positive index goes past the positive dataset length,
            if pos_index >= len_data_pos: 

                # If loop is set to False, break once we reach the end of the dataset
                if not loop:
                    stop = True;

                # If user wants to keep re-using the data, reset the index
                pos_index = 0

                if shuffle:
                    # Shuffle the index of the positive sample

            # get the tweet as pos_index
            tweet = data_pos[pos_index_lines[pos_index]]

            # convert the tweet into tensors of integers representing the processed words
            tensor = tweet_to_tensor(tweet, vocab_dict)

            # append the tensor to the batch list

            # Increment pos_index by one
            pos_index = pos_index + 1

        # Second part: Pack n_to_take negative examples

        # Using the same batch list, start from neg_index and increment i up to n_to_take
        for i in range(neg_index, n_to_take):

            # If the negative index goes past the negative dataset length,
            if neg_index > len_data_neg:

                # If loop is set to False, break once we reach the end of the dataset
                if not loop:
                    stop = True;

                # If user wants to keep re-using the data, reset the index
                neg_index = 0

                if shuffle:
                    # Shuffle the index of the negative sample
            # get the tweet at neg_index
            tweet = data_neg[neg_index_lines[neg_index]]

            # convert the tweet into tensors of integers representing the processed words
            tensor = tweet_to_tensor(tweet, vocab_dict)

            # append the tensor to the batch list

            # Increment neg_index by one
            neg_index += 1

        if stop:

        # Update the start index for positive data 
        # so that it's n_to_take positions after the current pos_index
        pos_index += n_to_take

        # Update the start index for negative data 
        # so that it's n_to_take positions after the current neg_index
        neg_index += n_to_take

        # Get the max tweet length (the length of the longest tweet) 
        # (you will pad all shorter tweets to have this length)
        max_len = max([len(t) for t in batch]) 

        # Initialize the input_l, which will 
        # store the padded versions of the tensors
        tensor_pad_l = []
        # Pad shorter tweets with zeros
        for tensor in batch:
            # Get the number of positions to pad for this tensor so that it will be max_len long
            n_pad = max_len - len(tensor)

            # Generate a list of zeros, with length n_pad
            pad_l = [0] * n_pad

            # concatenate the tensor and the list of padded zeros
            tensor_pad = tensor + pad_l

            # append the padded tensor to the list of padded tensors

        # convert the list of padded tensors to a numpy array
        # and store this as the model inputs
        inputs = numpy.array(tensor_pad_l)

        # Generate the list of targets for the positive examples (a list of ones)
        # The length is the number of positive examples in the batch
        target_pos = [1] * len(batch[:n_to_take])

        # Generate the list of targets for the negative examples (a list of zeros)
        # The length is the number of negative examples in the batch
        target_neg = [0] * len(batch[n_to_take:])

        # Concatenate the positve and negative targets
        target_l = target_pos + target_neg

        # Convert the target list into a numpy array
        targets = numpy.array(target_l)

        # Example weights: Treat all examples equally importantly.It should return an np.array. Hint: Use np.ones_like()
        example_weights = numpy.ones_like(targets)

        yield inputs, targets, example_weights

Now you can use your data generator to create a data generator for the training data, and another data generator for the validation data.

We will create a third data generator that does not loop, for testing the final accuracy of the model.

# Set the random number generator for the shuffle procedure
rnd = random

# Create the training data generator
def train_generator(batch_size, shuffle = False):
    return data_generator(positive_training, negative_training,
                          batch_size, True, vocabulary, shuffle)

# Create the validation data generator
def val_generator(batch_size, shuffle = False):
    return data_generator(positive_validation, negative_validation,
                          batch_size, True, vocabulary, shuffle)

# Create the validation data generator
def test_generator(batch_size, shuffle = False):
    return data_generator(positive_validation, negative_validation, batch_size,
                          False, vocabulary, shuffle)

# Get a batch from the train_generator and inspect.
inputs, targets, example_weights = next(train_generator(4, shuffle=True))
# this will print a list of 4 tensors padded with zeros
print(f'Inputs: {inputs}')
print(f'Targets: {targets}')
print(f'Example Weights: {example_weights}')
Inputs: [[2030 4492 3231    9    0    0    0    0    0    0    0]
 [5009  571 2025 1475 5233 3532  142 3532  132  464    9]
 [3798  111   96  587 2960 4007    0    0    0    0    0]
 [ 256 3798    0    0    0    0    0    0    0    0    0]]
Targets: [1 1 0 0]
Example Weights: [1 1 1 1]

Test the train_generator

Test the train_generator

tmp_data_gen = train_generator(batch_size = 4)

Call the data generator to get one batch and its targets.

tmp_inputs, tmp_targets, tmp_example_weights = next(tmp_data_gen)
print(f"The inputs shape is {tmp_inputs.shape}")
print(f"The targets shape is {tmp_targets.shape}")
print(f"The example weights shape is {tmp_example_weights.shape}")

for i,t in enumerate(tmp_inputs):
    print(f"input tensor: {t}; target {tmp_targets[i]}; example weights {tmp_example_weights[i]}")
The inputs shape is (4, 14)
The targets shape is (4,)
The example weights shape is (4,)
input tensor: [3 4 5 6 7 8 9 0 0 0 0 0 0 0]; target 1; example weights 1
input tensor: [10 11 12 13 14 15 16 17 18 19 20  9 21 22]; target 1; example weights 1
input tensor: [5807 2931 3798    0    0    0    0    0    0    0    0    0    0    0]; target 0; example weights 1
input tensor: [ 865  261 3689 5808  313 4499  571 1248 2795  333 1220 3798    0    0]; target 0; example weights 1

Bundle It Up


# python
from argparse import Namespace
from itertools import cycle

import random

# pypi
from nltk.corpus import twitter_samples

import attr
import numpy

# this project
from .processor import TwitterProcessor


Defaults = Namespace(
    split = 4000,

NLTK Settings

NLTK = Namespace(
    negative = "negative_tweets.json",

Special Tokens

SpecialTokens = Namespace(padding="__PAD__",

SpecialIDs = Namespace(

The Builder

class TensorBuilder:
    """converts tweets to tensors

     - split: where to split the training and validation data
    split = Defaults.split
    _positive: list=None
    _negative: list=None
    _positive_training: list=None
    _negative_training: list=None
    _positive_validation: list=None
    _negative_validation: list=None
    _process: TwitterProcessor=None
    _vocabulary: dict=None
    _x_train: list=None
  • Positive Tweets
    def positive(self) -> list:
        """The raw positive NLTK tweets"""
        if self._positive is None:
            self._positive = twitter_samples.strings(NLTK.positive)
        return self._positive
  • Negative Tweets
    def negative(self) -> list:
        """The raw negative NLTK tweets"""
        if self._negative is None:
            self._negative = twitter_samples.strings(NLTK.negative)
        return self._negative
  • Positive Training
    def positive_training(self) -> list:
        """The positive training data"""
        if self._positive_training is None:
            self._positive_training = self.positive[:self.split]
        return self._positive_training
  • Negative Training
    def negative_training(self) -> list:
        """The negative training data"""
        if self._negative_training is None:
            self._negative_training = self.negative[:self.split]
        return self._negative_training
  • Positive Validation
    def positive_validation(self) -> list:
        """The positive validation data"""
        if self._positive_validation is None:
            self._positive_validation = self.positive[self.split:]
        return self._positive_validation
  • Negative Validation
    def negative_validation(self) -> list:
        """The negative validation data"""
        if self._negative_validation is None:
            self._negative_validation = self.negative[self.split:]
        return self._negative_validation
  • Twitter Processor
    def process(self) -> TwitterProcessor:
        """processor for tweets"""
        if self._process is None:
            self._process = TwitterProcessor()
        return self._process
  • X Train
    def x_train(self) -> list:
        """The unprocessed training data"""
        if self._x_train is None:
            self._x_train = self.positive_training + self.negative_training
        return self._x_train
  • The Vocabulary
    def vocabulary(self) -> dict:
        """A map of token to numeric id"""
        if self._vocabulary is None:
            self._vocabulary = {SpecialTokens.padding: SpecialIDs.padding,
                                SpecialTokens.ending: SpecialIDs.ending,
                                SpecialTokens.unknown: SpecialIDs.unknown}
            for tweet in self.x_train:
                for token in self.process(tweet):
                    if token not in self._vocabulary:
                        self._vocabulary[token] = len(self._vocabulary)
        return self._vocabulary
  • To Tensor
    def to_tensor(self, tweet: str) -> list:
        """Converts tweet to list of numeric identifiers
         tweet: the string to convert
         list of IDs for the tweet
        tensor = [self.vocabulary.get(token, SpecialIDs.unknown)
                  for token in self.process(tweet)]
        return tensor

The Generator

class TensorGenerator:
    """Generates batches of vectorized-tweets

     converter: TensorBuilder object
     positive_data: list of positive data
     negative_data: list of negative data
     batch_size: the size for each generated batch     
     shuffle: whether to shuffle the generated data
     infinite: whether to generate data forever
    converter: TensorBuilder
    positive_data: list
    negative_data: list
    batch_size: int
    shuffle: bool=True
    infinite: bool = True
    _positive_indices: list=None
    _negative_indices: list=None
    _positives: iter=None
    _negatives: iter=None
  • Positive Indices
    def positive_indices(self) -> list:
        """The indices to use to grab the positive tweets"""
        if self._positive_indices is None:
            k = len(self.positive_data)
            if self.shuffle:
                self._positive_indices = random.sample(range(k), k=k)
                self._positive_indices = list(range(k))
        return self._positive_indices
  • Negative Indices
    def negative_indices(self) -> list:
        """Indices for the negative tweets"""
        if self._negative_indices is None:
            k = len(self.negative_data)
            if self.shuffle:
                self._negative_indices = random.sample(range(k), k=k)
                self._negative_indices = list(range(k))
        return self._negative_indices
  • Positives
    def positives(self):
        """The positive index generator"""
        if self._positives is None:
            self._positives = self.positive_generator()
        return self._positives
  • Negatives
    def negatives(self):
        """The negative index generator"""
        if self._negatives is None:
            self._negatives = self.negative_generator()
        return self._negatives
  • Positive Generator
    def positive_generator(self):
        """Generator of indices for positive tweets"""
        stop = len(self.positive_indices)
        index = 0
        while True:
            yield self.positive_indices[index]
            index += 1
            if index == stop:
                if not self.infinite:
                if self.shuffle:
                    self._positive_indices = None
                index = 0
  • Negative Generator
    def negative_generator(self):
        """generator of indices for negative tweets"""
        stop = len(self.negative_indices)
        index = 0
        while True:
            yield self.negative_indices[index]
            index += 1
            if index == stop:
                if not self.infinite:
                if self.shuffle:
                    self._negative_indices = None
                index = 0
  • The Iterator
    def __iter__(self):
        return self
  • The Next Method
    def __next__(self):
        assert self.batch_size % 2 == 0
        half_batch = self.batch_size // 2
        # get the indices
        positives = (next(self.positives) for index in range(half_batch))
        negatives = (next(self.negatives) for index in range(half_batch))
        # get the tweets
        positives = (self.positive_data[index] for index in positives)
        negatives = (self.negative_data[index] for index in negatives)
        # get the token ids
            positives = [self.converter.to_tensor(tweet) for tweet in positives]
            negatives = [self.converter.to_tensor(tweet) for tweet in negatives]
        except RuntimeError:
            # the next(self.positives) in the first generator will raise a
            # RuntimeError if
            # we're not running this infinitely
            raise StopIteration
        batch = positives + negatives
        longest = max((len(tweet) for tweet in batch))
        paddings = (longest - len(tensor) for tensor in batch)
        paddings = ([0] * padding for padding in paddings)
        padded = [tensor + padding for tensor, padding in zip(batch, paddings)]
        inputs = numpy.array(padded)
        # the labels for the inputs
        targets = numpy.array([1] * half_batch + [0] * half_batch)
        assert len(targets) == len(batch)
        # default the weights to ones
        weights = numpy.ones_like(targets)    
        return inputs, targets, weights

Test It Out

from neurotic.nlp.twitter.tensor_generator import TensorBuilder, TensorGenerator

converter = TensorBuilder()
tweet = positive_validation[0]
expected = [1072, 96, 484, 2376, 750, 8220, 1132, 750, 53, 2, 2701, 796, 2, 2,
            354, 606, 2, 3523, 1025, 602, 4599, 9, 1072, 158, 2, 2]

actual = converter.to_tensor(tweet)
generator = TensorGenerator(converter, batch_size=4)
(array([[ 749, 1019,  313, 1020,   75],
       [1009,    9,    0,    0,    0],
       [3540, 6030, 6031, 3798,    0],
       [  50,   96, 3798,    0,    0]]), array([1, 1, 0, 0]), array([1, 1, 1, 1]))
for count, batch in enumerate(generator):
    if count == 5:
[[  22 1228  434  354  227 2371    9]
 [ 267  160   89    0    0    0    0]
 [ 315 1008 8480 3798 2108  371 3233]
 [8232 8233  791 3798    0    0    0]]

[[1173 1061  586    9  896  729 1264  345 1062 1063]
 [3387  558  991 2166 3388 3231  558  238  120    0]
 [ 198 5997 3798    0    0    0    0    0    0    0]
 [ 223  310 3798    0    0    0    0    0    0    0]]

[[4015 4015 4015 4016  231 2117   57  422    9 4017 4018 4019   86   86]
 [2554   57  102  358   75    0    0    0    0    0    0    0    0    0]
 [  50   38  881 3798    0    0    0    0    0    0    0    0    0    0]
 [6729 6730 6731  382 3798    0    0    0    0    0    0    0    0    0]]

[[3479   75    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0]
 [4636 4637  233 4299  111  237 2626    9    0    0    0    0    0    0
     0    0    0]
 [  73  381  463 4321  142   96 7390 7391   92   85 1394 7392 5895 7393
    45 3798 7394]
 [8863 2844  991  127 5818    0    0    0    0    0    0    0    0    0
     0    0    0]]

[[ 226  615   22   75    0    0]
 [2135  703  237  435 3124    9]
 [2379 6264 3798    0    0    0]
 [6504 1912 2380 3798    0    0]]

[[5623  120    0    0    0    0    0    0    0    0]
 [ 133   54  102   63 1300   56    9   50   92 3181]
 [2094  383   73  464 3798    0    0    0    0    0]
 [ 223  101 8754  383 2085 5818 8755    0    0    0]]

(array([[ 374,   44, 2981,  435,  132,  111, 1040, 1382,    9,    0,    0,
       [ 369,  398,  283,    9, 2671, 1411,  136,  184,  769, 1262, 2061,
       [1094, 9024,  315,  381, 3798,    0,    0,    0,    0,    0,    0,
       [9036, 3798,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0]]), array([1, 1, 0, 0]), array([1, 1, 1, 1]))

Ladies and gentlemen, we have ourselves a generator.


Now that we have our data, the next step will be to define the model.

Sentiment Analysis: Deep Learning Model


Previously we created sentiment analysis models using the Logistic Regression and Naive Bayes algorithms. However if we were to give those models an example like:

This movie was almost good.

The model would have predicted a positive sentiment for that review. That sentence, however, is expressing the negative sentiment that the movie was not good. To solve those kinds of misclassifications we will write a program that uses deep neural networks to identify sentiment in text.

This model will follow a similar structure to the Continuous Bag of Words Model (Introducing the CBOW Model) that we looked at previously - indeed most of the deep nets have a similar structure. The only thing that changes is the model architecture, the inputs, and the outputs. Although we looked at Trax and JAX in a previous post (Introducing Trax) we'll start off with a review of some of their features and then in future posts we'll implement the actual model. These are the other posts.


# from python
import os
import random

# from pypi
from trax import layers
import trax
import trax.fastmath.numpy as numpy

Set Up

The Random Seed



Trax Review

JAX Arrays

First, the JAX reimplementation of numpy (from Trax.fastmath).

an_array = numpy.array(5.0)
DeviceArray(5., dtype=float32)
<class 'jax.interpreters.xla._DeviceArray'>

Note: the trax library is strict about the typing so 5 won't work, it has to be a float.


Now we'll create a function to square the array.

def square(x) :
    return x**2
print(f"f({an_array}) -> {square(an_array)}")
f(5.0) -> 25.0


The gradient (derivative) of function f with respect to its input x is the derivative of \(x^2\).

  • The derivative of \(x^2\) is \(2x\).
  • When x is 5, then 2x=10.

You can calculate the gradient of a function by using trax.fastmath.grad(fun=) and passing in the name of the function.

  • In this case the function you want to take the gradient of is square.
  • The object returned (saved in square_gradient in this example) is a function that can calculate the gradient of square for a given trax.fastmath.numpy array.

Use trax.fastmath.grad to calculate the gradient (derivative) of the function.

square_gradient = trax.fastmath.grad(fun=square)

<class 'function'>

Call the newly created function and pass in a value for x (the DeviceArray stored in 'a')

gradient_calculation = square_gradient(an_array)
DeviceArray(10., dtype=float32)

The function returned by trax.fastmath.grad takes in x=5 and calculates the gradient of square, which is 2x, which equals 10. The value is also stored as a DeviceArray from the jax library.


Now that we've had a brief review of Trax let's move on to loading the data.