Named Entity Recognition (NER)

We'll start with the question - "What is Named Entity Recognition (NER)?". NER is a subtask of information extraction that locates and classifies named entities in a text. The named entities could be organizations, persons, locations, times, etc.

We'll train a named entity recognition system that could be trained in a few seconds (on a GPU) and will get around 75% accuracy. Then we'll load in the exact version of the model, which was trained for a longer period of time. We can then evaluate the trained version of the model to get 96% accuracy! Finally, we'll test the named entity recognition system with new sentences.

NER: Pre-Processing the Data

Preprocessing The Data

We will be using a dataset from Kaggle which appears to have originally come from the Groningen Meaning Bank (a bank of texts, not money). The original data consists of four columns, the sentence number, the word, the part of speech of the word, and the tags. A few tags you might expect to see are:

  • geo: geographical entity
  • org: organization
  • per: person
  • gpe: geopolitical entity
  • tim: time indicator
  • art: artifact
  • eve: event
  • nat: natural phenomenon
  • O: filler word


# python
from collections import namedtuple
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import equal, expect
from sklearn.model_selection import train_test_split
from tabulate import tabulate

import pandas

Set Up

The Dataset

Note: to get the encoding for the file use file:

file -bi ner_dataset.csv

In this case we get:

application/csv; charset=iso-8859-1

Since it isn't ASCII or ISO-8 we'll have to tell pandas what the encoding is.

load_dotenv("posts/nlp/.env", override=True)
path = Path(os.environ["NER_DATASET"]).expanduser()
data = pandas.read_csv(path, encoding="ISO-8859-1")


The Kaggle Data

print(tabulate(data.iloc[:5], tablefmt="orgtbl", headers="keys"))
  Sentence # Word POS Tag
0 Sentence: 1 Thousands NNS O
1 nan of IN O
2 nan demonstrators NNS O
3 nan have VBP O
4 nan marched VBN O

As you can (kind of) tell, the sentences are broken up so that each row has one word in it.

To make it easier to work with I'm going to rename the columns.

data = data.rename(columns={"Sentence #":"sentence", "Word": "word", "Tag": "tag"})

Words and Tags

The first thing we're going to do is separate out the words to build our vocabulary. The vocabulary will be a mapping of each word to an index so that we can convert our text to numbers for our model. In addition we're going to add a <PAD> token so that if our input is to short we can pad it to be the right size. And an UNK token in case we don't know a word.

token = namedtuple("Token", "pad unknown".split())
Token = token(pad="<PAD>", unknown="UNK")
vocabulary = {word: index for index, word in enumerate(data.word.unique())}
vocabulary[Token.pad] = len(vocabulary)
vocabulary[Token.unknown] = len(vocabulary)

We're going to do the same with the Tag column.

tags = {tag: index for index, tag in enumerate(data.tag.unique())}
{'O': 0, 'B-geo': 1, 'B-gpe': 2, 'B-per': 3, 'I-geo': 4, 'B-org': 5, 'I-org': 6, 'B-tim': 7, 'B-art': 8, 'I-art': 9, 'I-per': 10, 'I-gpe': 11, 'I-tim': 12, 'B-nat': 13, 'B-eve': 14, 'I-eve': 15, 'I-nat': 16}

Note: This is actually cheating because I am using the whole dataset. Later on make sure to only use the training data.

Sentences and Labels

We're also going to need to smash the words back into sentences. There's probably a clever pandas way to do this, but I'll just brute-force it. We'll also need to join the labels for the sentences into strings.

sentences = []
labels = []
sentence = None
for row in data.itertuples():
    if not pandas.isna(row.sentence):
        if sentence:
        sentence = [row.word]
        label = [row.tag]
['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from', 'that', 'country', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']

We're going to convert them to numbers so I didn't join them into strings.

To Numbers

sentence_vectors = [
    [vocabulary.get(word, Token.unknown) for word in sentence]
    for sentence in sentences

assert len(sentence_vectors) == len(sentences)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 9, 15, 1, 16, 17, 18, 19, 20, 21]
label_vectors = [
    [tags[label] for label in sentence_labels] for sentence_labels in labels
assert len(label_vectors) == len(labels)
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]

In this case we're assuming that there's no unknown tags because they are only used for training and testing so we wouldn't expect to see one that isn't in our current dataset, unlike the sentences which are going to be used with new data and so might have tokens we haven't seen before.

We could add the padding here, but instead we're going to do it in the batch generator.

The Train-Test Split

This time we're going to do a real train-validation-test split.

splits = namedtuple("Split", "train validation test".split())
Split = splits(train=33570, validation=7194, test=7194)
x_train, x_leftovers, y_train, y_leftovers = train_test_split(sentences, labels, train_size=Split.train)
x_validation, x_test, y_validation, y_test = train_test_split(x_leftovers, y_leftovers, test_size=Split.test)

assert len(x_train) == Split.train
assert len(y_train) == Split.train
assert len(x_validation) == Split.validation
assert len(y_validation) == Split.validation
assert len(x_test) == Split.test
assert len(y_test) == Split.test

Bundling This Up


# python
from collections import namedtuple
from functools import partial
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split

import attr
import pandas

Some Constants

Read = namedtuple("Read", "dotenv key encoding".split())
READ = Read(dotenv="posts/nlp/.env", key="NER_DATASET",

COLUMNS={"Sentence #":"sentence",
         "Word": "word",
         "Tag": "tag"}

Token = namedtuple("Token", "pad unknown".split())
TOKEN = Token(pad="<PAD>", unknown="UNK")

Splits = namedtuple("Split", "train validation test".split())
SPLIT = Splits(train=33570, validation=7194, test=7194)

DataSets = namedtuple("DataSets", [

TheData = namedtuple("TheData", [

The Data Processor

Each of the three sets needs to be vectorized since I'm not saving the sentences beforehand. So this class handles that.

class DataFlattener:
    """Converts the kaggle data to sentences and labels

     data: the data to convert
    data: pandas.DataFrame
    _sentences: list=None
    _labels: list=None
  • Sentences
    def sentences(self) -> list:
        """List of sentences from the data"""
        if self._sentences is None:
        return self._sentences
  • Labels
    def labels(self) -> list:
        """List of labels from the data"""
        if self._labels is None:
        return self._labels
  • Sentences and Labels maker
    def set_sentences_and_labels(self) -> None:
        """Converts the data to lists
        of sentence token lists and also sets the labels
        self._sentences = []
        self._labels = []
        sentence = None
        for row in
            if not pandas.isna(row.sentence):
                if sentence:
                sentence = [row.word]
                labels = [row.tag]

Data Vectorizer

class DataVectorizer:
    """Converts the data-set strings to vectors

     data_sets: the split up data sets
     vocabulary: map from token to index
     tags: map from tag to index
    data_sets: namedtuple
    vocabulary: dict
    tags: dict
    _vectorized_datasets: namedtuple=None
  • Vectorized Data Sets
    def vectorized_datasets(self) -> namedtuple:
        """the original data sets converted to indices"""
        if self._vectorized_datasets is None:
            sentence_vectors = partial(self.to_vectors,
            label_vectors = partial(self.to_vectors,
            self._vectorized_datasets = DataSets(
                x_train = sentence_vectors(self.data_sets.x_train),
                y_train = label_vectors(self.data_sets.y_train),
                x_validate = sentence_vectors(self.data_sets.x_validate),
                y_validate = label_vectors(self.data_sets.y_validate),
                x_test = sentence_vectors(self.data_sets.x_test),
                y_test = label_vectors(self.data_sets.y_test),
        return self._vectorized_datasets
  • Sentence Vectors
    def to_vectors(self, source: list, to_index: dict) -> list:
        """Sentences converted to Integers
         source: iterator of tokenized strings to convert
         to_index: map to convert the tokens to indices
         tokens in source converted to indices
        vectors = [
                [to_index.get(token, TOKEN.unknown)
                 for token in line]
                for line in source
        assert len(vectors) == len(source)
        return vectors

The Splitter

class DataSplitter:
    """Splits up the training, testing, etc.

     split: constants with the train, test counts
     sentences: input data to split
     labels: y-data to split
     random_state: seed for the splitting
    split: namedtuple
    sentences: list
    labels: list    
    random_state: int=None
    _data_sets: namedtuple=None
  • Data Sets
    def data_sets(self) -> namedtuple:
        """The Split data sets"""
        if self._data_sets is None:
            x_train, x_leftovers, y_train, y_leftovers = train_test_split(
                self.sentences, self.labels,
            x_validate, x_test, y_validate, y_test = train_test_split(
            self._data_sets = DataSets(x_train=x_train,
            assert len(x_train) + len(x_validate) + len(x_test) == len(self.sentences)
        return self._data_sets

The Loader

class DataLoader:
    """Loads and converts the kaggle data

      read: the stuff to download the data
    read: namedtuple=READ    
    _data: pandas.DataFrame=None
    _vocabulary: dict=None
    _tags: dict=None
  • The Kaggle Data
    def data(self) -> pandas.DataFrame:
        """The original kaggle dataset"""
        if self._data is None:
            path = Path(os.environ[]).expanduser()
            self._data = pandas.read_csv(path,
            self._data = self._data.rename(columns=COLUMNS)
        return self._data
  • The Vocabulary
    def vocabulary(self) -> dict:
        """map of word to index
          This is creating a transformation of the entire data-set
        so it comes before the train-test-split so it uses the whole
        dataset, not just training
        if self._vocabulary is None:
            self._vocabulary = {
                word: index
                for index, word in enumerate(}
            self._vocabulary[TOKEN.pad] = len(self._vocabulary)
            self._vocabulary[TOKEN.unknown] = len(self._vocabulary)
        return self._vocabulary
  • The Tags
    def tags(self) -> dict:
        """map of tag to index"""
        if self._tags is None:
            self._tags = {tag: index for index, tag in enumerate(
            self._tags[TOKEN.unknown] = len(self._tags)
        return self._tags

The Processor

class NERData:
    """Master NER Data preparer

     read_constants: stuff to help load the dataset
     split_constants: stuff to help split the dataset
     random_state: seed for the splitting
    read_constants: namedtuple=READ
    split_constants: namedtuple=SPLIT
    random_state: int=33
    _data: namedtuple=None
    _loader: DataLoader=None
    _flattener: DataFlattener=None
    _splitter: DataSplitter=None
    _vectorizer = DataVectorizer=None
  • The Data
    def data(self) -> namedtuple:
        """The split up data sets"""
        if self._data is None:
            self._data = TheData(
        return self._data
  • The Loader
    def loader(self) -> DataLoader:
        """The loader of the data"""
        if self._loader is None:
            self._loader = DataLoader(
        return self._loader
  • The Flattener
    def flattener(self) -> DataFlattener:
        """The sentence and label builder"""
        if self._flattener is None:
            self._flattener = DataFlattener(
        return self._flattener
  • The Splitter
    def splitter(self) -> DataSplitter:
        """The splitter upper for the data"""
        if self._splitter is None:
            self._splitter = DataSplitter(
                sentences = self.flattener.sentences,
                labels = self.flattener.labels,
        return self._splitter
  • The Vectorizer
    def vectorizer(self) -> DataVectorizer:
        """Vectorizes the raw-data sets"""
        if self._vectorizer is None:
            self._vectorizer = DataVectorizer(
        return self._vectorizer

Testing It Out

from neurotic.nlp.named_entity_recognition import NERData

ner = NERData()


RNNS and Vanishing Gradients

Vanishing Gradients

This will be a look at the problem of vanishing gradients from an intuitive standpoint.


Adding layers to a neural network introduces multiplicative effects in both forward and backward propagation. The back-prop in particular presents a problem as the gradient of activation functions can be very small. Multiplied together across many layers, their product can be vanishingly small. This results in weights not being updated in the front layers and training not progressing.

Gradients of the sigmoid function, for example, are in the range 0 to 0.25. To calculate gradients for the front layers of a neural network the chain rule is used. This means that these tiny values are multiplied starting at the last layer, working backwards to the first layer, with the gradients shrinking exponentially at each step.


# python
from collections import namedtuple
from functools import partial

# pypi
import holoviews
import hvplot.pandas
import numpy
import pandas

# another project
from graeae import EmbedHoloviews

Set Up

SLUG = "rnns-and-vanishing-gradients"
Embed = partial(EmbedHoloviews,
Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(


The Data

This will be an evenly spaced set of points over an interval (see numpy.linspace).

STOP, STEPS = 10, 100
x = numpy.linspace(-STOP, STOP, STEPS)

The Sigmoid

Our activation function will be the sigmoid (wikipedia link) (well, the logistic function).

def sigmoid(x: numpy.ndarray) -> numpy.ndarray:
    return 1 / (1 + numpy.exp(-x))

Now we'll calculate the activations for our input data.

activations = sigmoid(x)

The Gradient

Our gradient is the derivative of the sigmoid.

def gradient(x: numpy.ndarray) -> numpy.ndarray:
    return (x) * (1 - x)

Now we can get the gradients for our activations.

gradients = gradient(activations)

Plotting the Sigmoid

tangent_x = 0
tangent_y = sigmoid(tangent_x)
span = 2

gradient_tangent = gradient(sigmoid(tangent_x))

tangent_plot_x = numpy.linspace(tangent_x - span, tangent_x + span, STEPS)
tangent_plot_y = tangent_y + gradient_tangent * (tangent_plot_x - tangent_x)

frame = pandas.DataFrame.from_dict(
    {"X": x,
     "Sigmoid": activations,
     "X-Tangent": tangent_plot_x,
     "Y-Tangent": tangent_plot_y,
     "Gradient": gradients})
plot = (frame.hvplot(x="X", y="Sigmoid").opts(
        * frame.hvplot(x="X", y="Gradient").opts(
        * frame.hvplot(x="X-Tangent",
            title="Sigmoid and Tangent",
output = Embed(plot=plot, file_name="sigmoid_tangent")()

The thing to notice is that as the input data moves away from the center (at 0) the gradients get smaller in either direction, rapidly approaching zero.

The Numerical Impact

Multiplication & Decay

Multiplying numbers smaller than 1 results in smaller and smaller numbers. Below is an example that finds the gradient for an input x = 0 and multiplies it over n steps. Look how quickly it 'Vanishes' to almost zero. Yet \(\sigma(x=0) \implies 0.5\) which has a sigmoid gradient of 0.25 and that happens to be the largest sigmoid gradient possible.

A Decay Simulation

Input data

n = 6
x = 0

gradients = gradient(sigmoid(x))
steps = numpy.arange(1, n + 1)
print("-- Inputs --")
print("steps :", n)
print("x value :", x)
print("sigmoid :", "{:.5f}".format(sigmoid(x)))
print("gradient :", "{:.5f}".format(gradients), "\n")
-- Inputs --
steps : 6
x value : 0
sigmoid : 0.50000
gradient : 0.25000 

Plot The Decay

decaying_values = (numpy.ones(len(steps)) * gradients).cumprod()
data = pandas.DataFrame.from_dict(dict(Step=steps, Gradient=decaying_values))
plot = data.hvplot(x="Step", y="Gradient").opts(
    title="Cumulative Gradient",
output = Embed(plot=plot, file_name="cumulative_gradient")()

The point being that the gradients very quickly approach zero.

So, How Do You Fix This?

One solution is to use activation functions that don't have tiny gradients. Other solutions involve more sophisticated model design. But they're both discussions for another time.

Deep N-Grams: Batch Generation

Generating Batches of Data

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).

  • The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.

This generator returns the data in a format that you could directly use in your model when computing the feed-forward pass of your algorithm. This iterator returns a batch of lines and a per-token mask. The batch is a tuple of three parts: inputs, targets, and mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.


# python
from itertools import cycle
import random

# from pypi
from expects import be_true, expect
import trax.fastmath.numpy as numpy

# this project
from neurotic.nlp.deep_rnn.data_loader import DataLoader

Set Up

The DataLoader

data_loader = DataLoader()


The Data Generator

  • While True loop: this will yield one batch at a time.
  • if index >= num_lines, set index to 0.
  • The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.
  • if len(line) < max_length append line to cur_batch.
    • Note that a line that has length equal to max_length should not be appended to the batch.
    • This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added.
    • So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be f length 5, which is the max length.
  • if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.

Remember that when calling np you are really calling trax.fastmath.numpy which is trax’s version of numpy that is compatible with JAX. As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray.


  • Use the line_to_tensor function above inside a list comprehension in order to pad lines with zeros.
  • Keep in mind that the length of the tensor is always 1 + the length of the original line of characters. Keep this in mind when setting the padding of zeros.

To get it to pass you'll have to pass in the to-tensor method of the DataLoader so we'll need to alias it to match their definition.

line_to_tensor = data_loader.to_tensor

Implementing the Generator

def data_generator(batch_size: int, max_length: int, data_lines: list,
                   line_to_tensor=line_to_tensor, shuffle: bool=True):
    """Generator function that yields batches of data

       batch_size (int): number of examples (in this case, sentences) per batch.
       max_length (int): maximum length of the output tensor.
       NOTE: max_length includes the end-of-sentence character that will be added
               to the tensor.  
               Keep in mind that the length of the tensor is always 1 + the length
               of the original line of characters.
       data_lines (list): list of the sentences to group into batches.
       line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
       shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

       tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).
       NOTE: jax.interpreters.xla.DeviceArray is trax's version of numpy.ndarray
    # initialize the index that points to the current position in the lines index array
    index = 0

    # initialize the list that will contain the current batch
    cur_batch = []

    # count the number of lines in data_lines
    num_lines = len(data_lines)

    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]

    # shuffle line indexes if shuffle is set to True
    if shuffle:

    while True:

        # if the index is greater or equal than to the number of lines in data_lines
        if index >= num_lines:
            # then reset the index to 0
            index = 0
            # shuffle line indexes if shuffle is set to True
            if shuffle:

        # get a line at the `lines_index[index]` position in data_lines
        line = data_lines[lines_index[index]]

        # if the length of the line is less than max_length
        if len(line) < max_length:
            # append the line to the current batch

        # increment the index by one
        index += 1

        # if the current batch is now equal to the desired batch size
        if len(cur_batch) == batch_size:

            batch = []
            mask = []

            # go through each line (li) in cur_batch
            for li in cur_batch:
                # convert the line (li) to a tensor of integers
                tensor = line_to_tensor(li)

                # Create a list of zeros to represent the padding
                # so that the tensor plus padding will have length `max_length`
                pad = [0] * (max_length - len(tensor))

                # combine the tensor plus pad
                tensor_pad = tensor + pad

                # append the padded tensor to the batch

                # A mask for  tensor_pad is 1 wherever tensor_pad is not
                # 0 and 0 wherever tensor_pad is 0, i.e. if tensor_pad is
                # [1, 2, 3, 0, 0, 0] then example_mask should be
                # [1, 1, 1, 0, 0, 0]
                # Hint: Use a list comprehension for this
                example_mask = [int(item != 0) for item in tensor_pad]

            # convert the batch (data type list) to a trax's numpy array
            batch_np_arr = numpy.array(batch)
            mask_np_arr = numpy.array(mask)

            # Yield two copies of the batch and mask.
            yield batch_np_arr, batch_np_arr, mask_np_arr

            # reset the current batch to an empty list
            cur_batch = []

Try out the data generator.

tmp_lines = ['12345678901',

Create a generator with a batch size of 2 and a maximum length of 10.

tmp_data_gen = data_generator(batch_size=2, 

Get one batch.

tmp_batch = next(tmp_data_gen)

View the batch.


expected = (numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))
for index, batch in enumerate(tmp_batch):
(DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
             [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32), DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
             [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32), DeviceArray([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))

Now that you have your generator, you can just call them and they will return tensors which correspond to your lines in Shakespeare. The first column and the second column are identical. Now you can go ahead and start building your neural network.

Repeating Batch generator

The way the iterator is currently defined, it will keep providing batches forever.

Although it is not needed, we want to show you the itertools.cycle function which is really useful when you have a generator that eventually stops.

Usually we want to cycle over the dataset multiple times during training (i.e. train for multiple epochs).

For small datasets we can use itertools.cycle to achieve this easily.

infinite_data_generator = cycle(
    data_generator(batch_size=2, max_length=10, data_lines=tmp_lines))
ten_lines = [next(infinite_data_generator) for _ in range(10)]

Bundle It Up

As always, since this is going to be needed further down the road, I'll bundle it up.


# python
import random

# pypi
import attr
import trax.fastmath.numpy as numpy

# this project
from neurotic.nlp.deep_rnn.data_loader import DataLoader

Data Generator

class DataGenerator:
    """Generates batches

     data: lines of data
     data_loader: something with to-tensor method
     batch_size: size of the batches
     max_length: the maximum length for a line (longer lines will be ignored)
     shuffle: whether to shuffle the data
    data: list
    data_loader: DataLoader
    batch_size: int
    max_length: int
    shuffle: bool=True
    _line_count: int= None
    _line_indices: list=None
    _generator: object=None

Line Count

def line_count(self) -> int:
    """Number of lines in the data"""
    if self._line_count is None:
        self._line_count = len(
    return self._line_count

Line Indices

def line_indices(self) -> list:
    """Indices of the lines in the data"""
    if self._line_indices is None:
        self._line_indices = list(range(self.line_count))
    return self._line_indices

The Iterator Method

def __iter__(self):
    """A pass-through for this method"""
    return self

The Batch Generator

def data_generator(self):
    """Generator method that yields batches of data

     (batch, batch, mask)
    index = 0
    current_batch = []
    if self.shuffle:

    while True:
        if index >= self.line_count:
            index = 0
            if self.shuffle:

        line =[self.line_indices[index]]
        if len(line) < self.max_length:
        index += 1

        if len(current_batch) == self.batch_size:
            batch = []
            mask = []
            for line in current_batch:
                tensor = self.data_loader.to_tensor(line)
                tensor += [0] * (self.max_length - len(tensor))
                mask.append([int(item != 0) for item in tensor])

            batch = numpy.array(batch)
            yield batch, batch, numpy.array(mask)
            current_batch = []

The Generator

def generator(self):
    """Infinite generator of batches"""
    if self._generator is None:
        self._generator = self.data_generator()
    return self._generator

The Next Method

def __next__(self):
    """make this an iterator"""
    return next(self.generator)

Try It Out

from neurotic.nlp.deep_rnn import DataGenerator, DataLoader

loader = DataLoader()
test_lines = ['12345678901',

generator = DataGenerator(data=test_lines,

actual = next(generator)

expected = (numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))
for index, batch in enumerate(actual):
    except AssertionError:

Deep N-Grams: Generating Sentences

Generating New Sentences

Now we'll use the language model to generate new sentences for that we need to make draws from a Gumble distribution.

The Gumbel Probability Density Function (PDF) is defined as: \[ f(z) = {1\over{\beta}}e^{\left(-z+e^{(-z)}\right)} \]

Where: \[ z = {(x - \mu)\over{\beta}} \]

The maximum value is what we choose as the prediction in the last step of a Recursive Neural Network RNN we are using for text generation. A sample of a random variable from an exponential distribution approaches the Gumbel distribution when the sample increases asymptotically. For that reason, the Gumbel distribution is used to sample from a categorical distribution.


# python
from pathlib import Path

# from pypi
import numpy

# this project
from neurotic.nlp.deep_rnn import GRUModel

Set Up

gru = GRUModel()
model = gru.model
ours = Path("~/models/gru-shakespeare-model/model.pkl.gz").expanduser()


The Gumbel Sample

def gumbel_sample(log_probabilities: numpy.array,
                  temperature: float=1.0) -> float:
    """Gumbel sampling from a categorical distribution

     log_probabilities: model predictions for a given input
     temperature: fudge

     the maximum sample
    u = numpy.random.uniform(low=1e-6, high=1.0 - 1e-6,
    g = -numpy.log(-numpy.log(u))
    return numpy.argmax(log_probabilities + g * temperature, axis=-1)

A Predictor


def predict(number_of_characters: int, prefix: str,
            break_on: int=END_OF_SENTENCE) -> str:
    """Predicts characters

     number_of_characters: how many characters to predict
     prefix: character to prompt the predictions
     break_on: identifier for character to prematurely stop on

     prefix followed by predicted characters
    inputs = [ord(character) for character in prefix]
    result = list(prefix)
    maximum_length = len(prefix) + number_of_characters
    for _ in range(number_of_characters):
        current_inputs = numpy.array(inputs + [0] * (maximum_length - len(inputs)))
        output = model(current_inputs[None, :])  # Add batch dim.
        next_character = gumbel_sample(output[0, len(inputs)])
        inputs += [int(next_character)]

        if inputs[-1] == break_on:
            break  # EOS

    return "".join(result)

Some Predictions

print(predict(32, ""))
you would not live at essenomed 

Yes, but I don't know anyone who would. Note that we are using a random sample, so repeatedly making predictions won't necessarily get you the same result.

print(predict(32, ""))
print(predict(32, ""))
print(predict(32, ""))
katharine       yes, you are like the 
le beau where's some of my prett
print(predict(64, "falstaff"))
falstaff        yea, marry, lady, she hath bianced three months.


print(predict(64, "beast"))
beastly, and god forbid, sir! our revenue's cannon,
start = "finger"
for word in range(5):
    start = predict(10, start)
finger, iago, an
finger, iago, and ask.
finger, iago, and ask.
finger, iago, and ask.
finger, iago, and ask.

So, if you feed it enough text, it becomes more deterministic.

SPACE = ord(" ")
start = "iago"
output = start
for word in range(10):
    tokens = predict(32, start).split()
    start = tokens[1] if len(tokens) > 1 else tokens[0]
    output = f"{output} {start}"
iago your husband if there never for you need no never

In the generated text above, you can see that the model generates text that makes sense capturing dependencies between words and without any input. A simple n-gram model would have not been able to capture all of that in one sentence.

On statistical methods

Using a statistical method will not give you results that are as good. The model would not be able to encode information seen previously in the data set and as a result, the perplexity will increase. The higher the perplexity, the worse your model is. Furthermore, statistical N-Gram models take up too much space and memory. As a result, it would be inefficient and too slow. Conversely, with deep neural networks, you can get a better perplexity. Note though, that learning about n-gram language models is still important and leads to a better understanding of deep neural networks.

Deep N-Grams: Evaluating the Model

Evaluating the Model

Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as:

\[ P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}} \]

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our RNN, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure that is artificially good).

\begin{align} log P(W) &= {log\left(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,\ldots,w_{n-1})}}\right)} \\ &= {log\left({\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,\ldots,w_{n-1})}}\right)^{\frac{1}{N}}}\\ & = {log\left({\prod_{i=1}^{N}{P(w_i| w_1,\ldots,w_{n-1})}}\right)^{-\frac{1}{N}}} \\ & = -\frac{1}{N}{log\left({\prod_{i=1}^{N}{P(w_i| w_1,\ldots,w_{n-1})}}\right)} \\ & = -\frac{1}{N}{\left({\sum_{i=1}^{N}{logP(w_i| w_1,\ldots,w_{n-1})}}\right)} \end{align}

Instructions: Write a program that will help evaluate your model. Implementation hack: your program takes in preds and target. Preds is a tensor of log probabilities. You can use tl.one_hot to transform the target into the same dimension. You then multiply them and sum.

You also have to create a mask to only get the non-padded probabilities. Good luck!


  • To convert the target into the same dimension as the predictions tensor use with target and preds.shape[-1].
  • You will also need the np.equal function in order to unpad the data and properly compute perplexity.
  • Keep in mind while implementing the formula above that \(w_i\) represents a letter from our 256 letter alphabet.


# python
from collections import namedtuple
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from trax import layers

import trax.fastmath.numpy as numpy
import jax
# this project
from neurotic.nlp.deep_rnn import DataLoader, DataGenerator, GRUModel

Set Up

DataSettings = namedtuple(
    "batch_size max_length output".split())
SETTINGS = DataSettings(batch_size=32,
loader = DataLoader()
training_generator = DataGenerator(, data_loader=loader,


def test_model(preds: jax.interpreters.xla.DeviceArray,
               target: jax.interpreters.xla.DeviceArray) -> float:
    """Function to test the model.

       preds: Predictions of a list of batches of tensors corresponding to lines of text.
       target: Actual list of batches of tensors corresponding to lines of text.

       float: log_perplexity of the model.
    total_log_ppx = numpy.sum(layers.one_hot(x=target, n_categories=preds.shape[-1]) * preds, axis= -1) # HINT: tl.one_hot() should replace one of the Nones

    non_pad = 1.0 - numpy.equal(target, 0)          # You should check if the target equals 0
    ppx = total_log_ppx * non_pad                             # Get rid of the padding

    log_ppx = numpy.sum(ppx) / numpy.sum(non_pad)

    return -log_ppx


Pre-Built Model

We're going to start with a pre-built file and see how it does relative to our model.

gru = GRUModel()
model = gru.model
pre_built = Path(os.environ["PRE_BUILT_MODEL"]).expanduser()
batch = next(training_generator)
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, numpy.exp(log_ppx))
The log perplexity and perplexity of your model are respectively 2.0370717 7.6681223

Our Model

gru = GRUModel()
model = gru.model
ours = Path("~/models/gru-shakespeare-model/model.pkl.gz").expanduser()
batch = next(training_generator)
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, numpy.exp(log_ppx))
The log perplexity and perplexity of your model are respectively 0.93021315 2.5350494

On the one hand I over-trained my model, on the other hand… why such a big difference?

Deep N-Grams: Training the Model

Training The Model

Now we are going to train the model. We have to define:

  • the cost function
  • the optimizer

To train a model on a task, Trax defines an abstraction called which packages the training data, loss, and optimizer (among other things) together into an object.

Similarly, to evaluate a model Trax defines an abstraction which packages the eval data and metrics (among other things) into another object (and which doesn't seem to have any documentation yet).

The final piece tying things together is the abstraction that is a very simple and flexible way to put everything together and train the model, all the while evaluating it and saving checkpoints.

Using training.Loop will save you a lot of code compared to always writing the training loop by hand, like you did in courses 1 and 2. More importantly, you are less likely to have a bug in that code that would ruin your training.


# python
from collections import namedtuple
from datetime import datetime
from functools import partial

# pypi
from expects import equal, expect
from holoviews import opts
from trax.supervised import training as trax_training
from trax import layers

import holoviews
import hvplot.pandas
import pandas
import trax

# this project
from neurotic.nlp.deep_rnn import GRUModel, DataGenerator, DataLoader

# another project
from graeae import EmbedHoloviews, Timer

Set Up

Some Constants

DataSettings = namedtuple(
    "batch_size max_length learning_rate output".split())
SETTINGS = DataSettings(batch_size=32,

Previous Code From this Series

loader = DataLoader()

# the name "training" was getting confusing (since trax's module is also called
# training) so this is training_generator and their's is trax_training
training_generator = DataGenerator(, data_loader=loader,

evaluation = DataGenerator(data=loader.validation, data_loader=loader,
gru = GRUModel()


slug = "deep-n-grams-training-the-model"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{slug}")

Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(


Some Jargon

An epoch is traditionally defined as one pass through the dataset.

Since the dataset was divided into batches you need several steps (gradient evaluations) in order to complete an epoch. So, one epoch corresponds to the number of examples in a batch times the number of steps. In short, in each epoch you go over all of the data.

The max_length variable defines the maximum length of lines to be used in training our data, lines longer that that length are discarded.

Below is a function and results that indicate how many lines conform to our criteria of maximum length of a sentence in the entire dataset and how many steps are required in order to cover the entire dataset which in turn corresponds to an epoch.

def lines_used(lines: list, max_length: int) -> int:
    """Counts the number of lines of max_length or shorter

     lines: all lines of text as an array of lines
     max_length: maximum length of a line to use

     number of usable examples
    return sum(1 for line in lines if len(line) <= max_length)

Let's see what we get.

useable = lines_used(, 32)
print(f"Number of used lines from the dataset: {useable:,}")
print(f"Batch size (a power of 2): {SETTINGS.batch_size}")
steps_per_epoch = int(useable/SETTINGS.batch_size)
print(f"Number of steps to cover one epoch: {steps_per_epoch}")

# our training sets aren't exactly the same for some reason.
# expect(useable).to(equal(25881))
# expect(steps_per_epoch).to(equal(808))
Number of used lines from the dataset: 25,781
Batch size (a power of 2): 32
Number of steps to cover one epoch: 805

It looks like the original notebook used os.listdir while I'm using Path.glob. Neither of them load the files in alphabetical order, but they also don't load them in the same order as each other for some reason, so our data sets are the same length but the training and validation split created slightly different sets. Oh, well.

Training the Model

We'll implement the train_model program below to train the neural network we created in the previous post. Here is a list of things to do:

  • Create a trax.supervised.trainer.TrainTask object:
  • Create a trax.supervised.trainer.EvalTask object:
    • labeled_data = the labeled data that we want to evaluate on.
    • metrics = CrossEntropyLoss() and Accuracy()
    • How frequently we want to evaluate and checkpoint the model.
  • Create a trax.supervised.trainer.Loop object, this encapsulates the following:
    • The previously created TrainTask and EvalTask objects.
    • the training model
    • optionally the evaluation model, if different from the training model. NOTE: in presence of Dropout, etc. we usually want the evaluation model to behave slightly differently than the training model.

We will be using a cross entropy loss, with the Adam optimizer. See the trax documentation to get a better understanding. Make sure you use the number of steps provided as a parameter to train for the desired number of steps.

NOTE: Don't forget to wrap the data generator in itertools.cycle to iterate on it for multiple epochs.

def train_model(model: layers.Serial, data_generator: DataGenerator,
                batch_size: int=SETTINGS.batch_size,
                max_length: int=SETTINGS.max_length,
                eval_lines: list=loader.validation,
                n_steps: int=1, output_dir='model/') -> training.Loop: 
    """Function that trains the model

      model: GRU model.
      data_generator: Data generator function.
      batch_size: Number of lines per batch.
      max_length: Maximum length allowed for a line to be processed. 
      lines: List of lines to use for training. Defaults to lines.
      eval_lines: List of lines to use for evaluation.
      n_steps: Number of steps to train.
      output_dir: Relative path of directory to save model.

      Training loop for the model.
    # this is the broken version for submission, I'll make a separate one for local running.

    bare_train_generator = data_generator(batch_size, max_length, lines,
    infinite_train_generator = itertools.cycle(bare_train_generator)

    bare_eval_generator = data_generator(batch_size, max_length,

    infinite_eval_generator = itertools.cycle(bare_eval_generator)

    # the notebook code is out of date so we need to have one for them and one for us... damnit
    # this first one is theirs
    train_task = training.TrainTask(
        loss_layer=tl.CrossEntropyLoss(),   # Don't forget to instantiate this object
        optimizer=trax.optimizers.Adam(learning_rate=0.0005)     # Don't forget to add the learning rate parameter

    eval_task = training.EvalTask(
        metrics=[tl.CrossEntropyLoss(), tl.Accuracy()], # Don't forget to instantiate these objects
        n_eval_batches=3      # For better evaluation accuracy in reasonable time

    training_loop = training.Loop(model,

    # We return this because it contains a handle to the model, which has the weights etc.
    return training_loop
training_loop = train_model(GRULM(), data_generator)

The model was only trained for 1 step due to the constraints of this environment. Even on a GPU accelerated environment it will take many hours for it to achieve a good level of accuracy. For the rest of the assignment you will be using a pretrained model but now you should understand how the training can be done using Trax.

Take Two

def take_two(model: layers.Serial,
             training: DataGenerator,
             evaluation: DataGenerator,
             learning_rate: float=SETTINGS.learning_rate,
             batches: int=1,
             evaluation_batches: int=3,
             steps_per_checkpoint: int=1000,
             output_dir=SETTINGS.output) -> trax_training.Loop: 
    """Function that trains the model

      model: GRU model.
      training: cycling data generator for training
      evaluation: cycling data generator for evaluation
      learning_rate: alpha for the optimizer
      batches: Number of batches to train.
      evaluation_batches: number of evaluation batches to run
      steps_per_checkpoint: how often to stop and evaluate the model
      output_dir: Relative path of directory to save model.

      Training loop for the model.
    train_task = trax_training.TrainTask(

    eval_task = trax_training.EvalTask(

    training_loop = trax_training.Loop(model,
    start =
    print(f"Elapsed: { - start}")
    return training_loop
loop = take_two(gru.model, training_generator, evaluation, batches=1000)

It looks like it's stuck.

Plotting Accuracy

frame = pandas.DataFrame(loop.history.get("eval", "metrics/Accuracy"),
                         columns="Batch Accuracy".split())
maximum = frame.loc[frame.Accuracy.idxmax()]
vline = holoviews.VLine(maximum.Batch).opts(opts.VLine(
hline = holoviews.HLine(maximum.Accuracy).opts(opts.HLine(
line = frame.hvplot(x="Batch", y="Accuracy").opts(opts.Curve(

plot = (line * hline * vline).opts(
                                   width=PLOT.width, height=PLOT.height, title="Evaluation Batch Accuracy",
output = Embed(plot=plot, file_name="evaluation_accuracy")()

Plotting Loss

frame = pandas.DataFrame(loop.history.get("eval", "metrics/WeightedCategoryCrossEntropy")
                         , columns="Batch Loss".split())
minimum = frame.loc[frame.Loss.idxmin()]
vline = holoviews.VLine(minimum.Batch).opts(opts.VLine(
hline = holoviews.HLine(minimum.Loss).opts(opts.HLine(
line = frame.hvplot(x="Batch", y="Loss").opts(opts.Curve(

plot = (line * hline * vline).opts(
                                   width=PLOT.width, height=PLOT.height, title="Evaluation Batch Cross Entropy",
output = Embed(plot=plot, file_name="evaluation_cross_entropy")()
: :

Well, it looks like it's getting worse, not better. I'm probably overfitting. I guess this model isn't good enough to do better.

Deep N-Grams: Creating the Model

Defining the GRU Model

We're going to build a GRU model using trax. We'll do this by passing in "layers" to the Serial class:

  • Serial: Class that applies layers serially (by function composition).
    • You can pass in the layers as arguments to Serial, separated by commas.
    • For example: Serial(Embeddings(...), Mean(...), Dense(...), LogSoftmax(...))

These are the layers that we'll be using:

  • ShiftRight: A layer that adds padding to shift the input. (note that this is one of the Trax methods that has re-named the arguments)
    • ShiftRight(n_positions=1, mode'train')= layer to shift the tensor to the right n_positions times
    • Here in the exercise you only need to specify the mode and not worry about n_positions
  • Embedding: Initializes the embedding layer which maps tokens/IDs to vectors
    • Embedding(vocab_size, d_feature). In this case it is the size of the vocabulary by the dimension of the model.
    • vocab_size is the number of unique words in the given vocabulary.
    • d_feature is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
  • GRU: The Trax GRU layer.
  • Dense: A dense (fully-connected) layer.
    • Dense(n_units): The parameter n_units is the number of units chosen for this dense layer.
  • LogSoftmax: Log of the output probabilities.
    • Here, you don't need to set any parameters for LogSoftMax().


# pypi
from trax import layers


The GRU Model

def GRULM(vocab_size: int=256, d_model: int=512, n_layers: int=2, mode:str='train') -> layers.Serial:
    """Returns a GRU language model.

       vocab_size (int, optional): Size of the vocabulary. Defaults to 256.
       d_model (int, optional): Depth of embedding (n_units in the GRU cell). Defaults to 512.
       n_layers (int, optional): Number of GRU layers. Defaults to 2.
       mode (str, optional): 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to "train".

       trax.layers.combinators.Serial: A GRU language model as a layer that maps from a tensor of tokens to activations over a vocab set.
    model = layers.Serial(
        # the ``n_shifts`` argument seems to have changed to ``n_positions``,
        # don't use it remain be backwards compatible
        layers.ShiftRight(1, mode=mode),
        layers.Embedding(vocab_size, d_model),
        *[layers.GRU(d_model) for unit in range(n_layers)],
    return model

Will It Build?

model = GRULM()

Saving it for Later

It seems a little goofy to do this, but since I might forget some of the values, might as well.


# from pypi
from trax import layers

import attr

Model Builder

class GRUModel:
    """Builds the layers for the GRU model

     shift_positions: amount of padding to add to the front of input
     vocabulary_size: the size of our learned vocabulary
     model_dimensions: the GRU and Embeddings dimensions
     gru_layers: how many GRU layers to create
     mode: train, eval, or predict
    shift_positions: int=1
    vocabulary_size: int=256
    model_dimensions: int=512
    gru_layers: int=2
    mode: str="train"
    _model: layers.Serial=None
  • The Model
    def model(self) -> layers.Serial:
        """The GRU Model"""
        if self._model is None:
            self._model = layers.Serial(
                layers.ShiftRight(self.shift_positions, mode=self.mode),
                layers.Embedding(self.vocabulary_size, self.model_dimensions),
                  for gru_layer in range(self.gru_layers)],
        return self._model

Check It Out

from neurotic.nlp.deep_rnn import GRUModel

gru = GRUModel()

Deep N-Grams: Loading the Data

Text to Tensor

In this section we're going to load the text data and transform it into tensors.


# python
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import (be_true,

Set Up

The path to the data is kept in a .env file so we'll load it into the environment here.

load_dotenv("posts/nlp/.env", override=True)
data_path = Path(os.environ["SHAKESPEARE"]).expanduser()


Loading the Data

We're going to be using the plays of Shakespeare. Unlike previously, this data source has them in separate files so we'll have to load each one separately. We're going to be generating characters, not words, so each character has to be given an integer ID. We'll use the Unicode values given to us by the built-in ord function.

lines = []
for filename in data_path.glob("*.txt"):
    with as play:
        cleaned = (line.strip() for line in play)
        lines += [line for line in cleaned if line]

This only cleans out the leading and trailing whitespace, there are other things like tabs still in there.

line_count = len(lines)
print(f"Number of lines: {line_count:,}")
print(f"Sample line at position 0: {lines[0]}")
print(f"Sample line at position 999: {lines[999]}")
Number of lines: 125,097
Sample line at position 0: king john
Sample line at position 999: as it makes harmful all that speak of it.

To make this a little easier, we'll convert all characters to lowercase. This way, for example, the model only needs to predict the likelihood that a letter is 'a' and not decide between uppercase 'A' and lowercase 'a'.

lines = [line.lower() for line in lines]

new_line_count = len(lines)
print(f"Number of lines: {new_line_count:,}")
print(f"Sample line at position 0: {lines[0]}")
print(f"Sample line at position 999: {lines[999]}")
Number of lines: 125,097
Sample line at position 0: king john
Sample line at position 999: as it makes harmful all that speak of it.

Once again, we're gong to do a strait split to create the training and validation data instead of using randomization.

SPLIT = 1000
validation = lines[-SPLIT:]
training = lines[:-SPLIT]

print(f"Number of lines for training: {len(training):,}")
print(f"Number of lines for validation: {len(validation):,}")
Number of lines for training: 124,097
Number of lines for validation: 1,000

To Tensors

Like I mentioned before, we're going to use python's ord function to convert the letters to integers.

for character in "abc xyz123":
    print(f"{character}: {ord(character)}")
a: 97
b: 98
c: 99
 : 32
x: 120
y: 121
z: 122
1: 49
2: 50
3: 51
def line_to_tensor(line: str, EOS_int: int=1) -> list:
    """Turns a line of text into a tensor

     line: A single line of text.
     EOS_int: End-of-sentence integer. Defaults to 1.

     a list of integers (unicode values) for the characters in the ``line``.
    tensor = []
    # for each character:
    for c in line:

        # convert to unicode int
        c_int = ord(c)

        # append the unicode integer to the tensor list

    # include the end-of-sentence integer
    return tensor

Test the Output

actual = line_to_tensor('abc xyz')
expected = [97, 98, 99, 32, 120, 121, 122, 1]


Bundle It Up

This is going to be needed in future posts so I'm going to put it in a class.


# python
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv

import attr

The Data Loader

class DataLoader:
    """Load the data and convert it to 'tensors'

     env_path: the path to the env file (as a string)
     env_key: the environmental variable with the path to the data
     validation_size: number for the validation set
     end_of_sentence: integer to use to indicate the end of a sentence
    env_path: str="posts/nlp/.env"
    env_key: str="SHAKESPEARE"
    validation_size: int=1000
    end_of_sentence: int=1
    _data_path: Path=None
    _lines: list=None
    _training: list=None
    _validation: list=None

The Data Path

def data_path(self) -> Path:
    """Loads the dotenv and converts the path

     assertion error if path doesn't exist
    if self._data_path is None:
        load_dotenv(self.env_path, override=True)
        self._data_path = Path(os.environ[self.env_key]).expanduser()
        assert self.data_path.is_dir()
    return self._data_path

The Lines

def lines(self) -> list:
    """The lines of text-data"""
    if self._lines is None:
        self._lines = []
        for filename in self.data_path.glob("*.txt"):
            with as play:
                cleaned = (line.strip() for line in play)
                self._lines += [line.lower() for line in cleaned if line]
    return self._lines

The Training Set

def training(self) -> list:
    """Subset of the lines for training"""
    if self._training is None:
        self._training = self.lines[:-self.validation_size]
    return self._training

The Validation Set

def validation(self) -> list:
    """The validation subset of the lines"""
    if self._validation is None:
        self._validation = self.lines[-self.validation_size:]
    return self._validation

To Tensor

def to_tensor(self, line: str) -> list:
    """Converts the line to the unicode value

     line: the text to convert
     line converted to unicode integer encodings
    return [ord(character) for character in line] + [self.end_of_sentence]

Check the Data Loader

from neurotic.nlp.deep_rnn.data_loader import DataLoader

loader = DataLoader()

expect(len( - SPLIT))

actual = loader.to_tensor('abc xyz')
expected = [97, 98, 99, 32, 120, 121, 122, 1]

for line in loader.lines[:10]:
king john
dramatis personae
king john:
prince henry    son to the king.
arthur  duke of bretagne, nephew to the king.
the earl of
pembroke        (pembroke:)
the earl of essex       (essex:)
the earl of
salisbury       (salisbury:)

Deep N-Grams

Deep N-Grams

This is an exploration of Recurrent Neural Networks (RNN) using trax. We're going to predict the next set of characters in a sentence given the previous characters.

Since this is so long I'm going to break it up into separate posts.

First up: - Loading the Data.