Siamese Networks: The Data

Transforming the Data

We'll will be using the Quora question answer dataset to build a model that could identify similar questions. This is a useful task because you don't want to have several versions of the same question posted. Several times when teaching I end up responding to similar questions on piazza, or on other community forums. This data set has been labeled for you. Run the cell below to import some of the packages you will be using.

Imports

# python
from collections import defaultdict
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import expect, contain_exactly

import nltk
import numpy
import pandas

# my other stuff
from graeae import Timer

Set Up

The Timer

TIMER = Timer()

NLTK

We need to download the punkt data to be able to tokenize our sentences.

nltk.download("punkt")
[nltk_data] Downloading package punkt to
[nltk_data]     /home/neurotic/data/datasets/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

The Training Data

load_dotenv("posts/nlp/.env")
path = Path(os.environ["QUORA_TRAIN"]).expanduser()
data = pandas.read_csv(path)

Middle

Inspecting the Data

rows, columns = data.shape
print(f"Rows: {rows:,} Columns: {columns}")
Rows: 404,290 Columns: 6
print(data.iloc[0])
id                                                              0
qid1                                                            1
qid2                                                            2
question1       What is the step by step guide to invest in sh...
question2       What is the step by step guide to invest in sh...
is_duplicate                                                    0
Name: 0, dtype: object

So, you can see that we have a row ID, followed by IDs for each of the questions, followed by the question-pair, and finally a label of whether the two questions are duplicates (1) or not (0).

Train Test Split

For the moment we're going to use a straight splitting of the dataset, rather than using a shuffled split. We're going for a roughly 75-25 split.

training_size = 3 * 10**5
training_data = data.iloc[:training_size]
testing_data = data.iloc[training_size:]

assert len(training_data) == training_size

Since the data set is large, we'll delete the original pandas DataFrame to save memory.

del(data)

Filtering Out Non-Duplicates

We are going to use only the question pairs that are duplicate to train the model.

We build two batches as input for the Siamese network and we assume that question \(q1_i\) (question i in the first batch) is a duplicate of \(q2_i\) (question i in the second batch), but all other questions in the second batch are not duplicates of \(q1_i\).

The test set uses the original pairs of questions and the status describing if the questions are duplicates.

duplicates = training_data[training_data.is_duplicate==1]
example = duplicates.iloc[0]
print(example.question1)
print(example.question2)
print(example.is_duplicate)
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?
1
print(f"There are {len(duplicates):,} duplicates for the training data.")
There are 111,473 duplicates for the training data.

We only took the duplicated questions for training our model because the data generator will produce batches \(([q1_1, q1_2, q1_3, ...]\), [q2_1, q2_2,q2_3, …])\) where \(q1_i\) and \(q2_k\) are duplicate if and only if \(i = k\).

Encoding the Words

Now we'll encode each word of the selected duplicate pairs with an index. Given a question, we can then just encode it as a list of numbers.

First we'll tokenize the questions using nltk.word_tokenize.

We'll also need a python default dictionary which later, during inference, assigns the value 0 to all Out Of Vocabulary (OOV) words.

Build the Vocabulary

We'll start by resetting the index. Pandas preserves the original index, but since we dropped the non-duplicates it's missing rows so resetting it will start it at 0 again. By default it normally keeps the original index as a column, but passing in drop=True prevents that.

reindexed = duplicates.reset_index(drop=True)

Now we'll build the vocabulary by mapping the words to the "index" for that word in the dictionary.

vocabulary = defaultdict(lambda: 0)
vocabulary['<PAD>'] = 1

with TIMER:
    question_1_train = duplicates.question1.apply(nltk.word_tokenize)
    question_2_train = duplicates.question2.apply(nltk.word_tokenize)
    combined = question_1_train + question_2_train
    for index, tokens in combined.iteritems():
        tokens = (token for token in set(tokens) if token not in vocabulary)
        for token in tokens:
            vocabulary[token] = len(vocabulary) + 1
print(f"There are {len(vocabulary):,} words in the vocabulary.")            
Started: 2021-01-30 18:36:26.773827
Ended: 2021-01-30 18:36:46.522680
Elapsed: 0:00:19.748853
There are 36,278 words in the vocabulary.

Some example vocabulary words.

print(vocabulary['<PAD>'])
print(vocabulary['Astrology'])
print(vocabulary['Astronomy'])
1
7
0

The last 0 indicates that, while Astrology is in our vocabulary, Astronomy is not. Peculiar.

Now we'll set up the test arrays. One of the Question 1 entries is empty so we'll have to drop it first.

testing_data = testing_data[~testing_data.question1.isna()]
with TIMER:
    Q1_test_words = testing_data.question1.apply(nltk.word_tokenize)
    Q2_test_words = testing_data.question2.apply(nltk.word_tokenize)
Started: 2021-01-30 16:43:08.891230
Ended: 2021-01-30 16:43:27.954422
Elapsed: 0:00:19.063192

Converting a question to a tensor

We'll now convert every question to a tensor, or an array of numbers, using the vocabulary we built above.

def words_to_index(words):
    return [vocabulary[word] for word in words]

Q1_train = question_1_train.apply(words_to_index)
Q2_train = question_2_train.apply(words_to_index)

Q1_test = Q1_test_words.apply(words_to_index)
Q2_test = Q2_test_words.apply(words_to_index)

print('first question in the train set:\n')
print(question_1_train.iloc[0], '\n') 
print('encoded version:')
print(Q1_train.iloc[0],'\n')
first question in the train set:

['Astrology', ':', 'I', 'am', 'a', 'Capricorn', 'Sun', 'Cap', 'moon', 'and', 'cap', 'rising', '...', 'what', 'does', 'that', 'say', 'about', 'me', '?'] 

encoded version:
[7, 6, 17, 26, 22, 12, 15, 14, 2, 24, 16, 19, 31, 8, 9, 21, 25, 3, 23, 29] 

print(f"{len(vocabulary):,}")
77,068

Validation Set

You will now split your train set into a training/validation set so that you can use it to train and evaluate your Siamese model.

TRAINING_FRACTION = 0.8
cut_off = int(len(question_1_train) * TRAINING_FRACTION)
train_question_1, train_question_2 = Q1_train[:cut_off], Q2_train[:cut_off]
validation_question_1, validation_question_2 = Q1_train[cut_off: ], Q2_train[cut_off:]
print(f"Number of duplicate questions: {len(Q1_train):,}")
print(f"The length of the training set is:  {len(train_question_1):,}")
print(f"The length of the validation set is: {len(validation_question_1):,}")
Number of duplicate questions: 111,473
The length of the training set is:  89,178
The length of the validation set is: 22,295

Bundling It Up

Imports

# python
from collections import defaultdict, namedtuple
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from pathlib import Path

import attr
import nltk
import pandas

NLTK Setup

nltk.download("punkt")

Constants and Data

Tokens = namedtuple("Tokens", ["unknown", "padding", "padding_token"])
TOKENS = Tokens(unknown=0,
                padding=1,
                padding_token="<PAD>")

Question = namedtuple("Question", ["question_one", "question_two"])
Data = namedtuple("Data", ["train", "validate", "test", "y_test"])

The Data Tokenizer

@attr.s(auto_attribs=True)
class DataTokenizer:
    """Converts questions to tokens

    Args:
     data: the data-frame to tokenize
    """
    data: pandas.DataFrame
    _question_1: pandas.Series=None
    _question_2: pandas.Series=None

Question 1

@property
def question_1(self) -> pandas.Series:
    """tokenized version of question 1"""
    if self._question_1 is None:
        self._question_1 = self.data.question1.apply(nltk.word_tokenize)
    return self._question_1

Question 2

@property
def question_2(self) -> pandas.Series:
    """tokenized version of question 2"""
    if self._question_2 is None:
        self._question_2 = self.data.question2.apply(nltk.word_tokenize)
    return self._question_2

The Data Tensorizer

@attr.s(auto_attribs=True)
class DataTensorizer:
    """Convert tokenized words to numbers

    Args:
     vocabulary: word to integer mapping
     question_1: data to convert
     question_2: other data to convert
    """
    vocabulary: dict
    question_1: pandas.Series
    question_2: pandas.Series
    _tensorized_1: pandas.Series=None
    _tensorized_2: pandas.Series=None

Tensorized 1

@property
def tensorized_1(self) -> pandas.Series:
    """numeric version of question 1"""
    if self._tensorized_1 is None:
        self._tensorized_1 = self.question_1.apply(self.to_index)
    return self._tensorized_1

Tensorized 2

@property
def tensorized_2(self) -> pandas.Series:
    """Numeric version of question 2"""
    if self._tensorized_2 is None:
        self._tensorized_2 = self.question_2.apply(self.to_index)
    return self._tensorized_2

To Index

def to_index(self, words: list) -> list:
    """Convert list of words to list of integers"""
    return [self.vocabulary[word] for word in words]

The Data Transformer

@attr.s(auto_attribs=True)
class DataLoader:
    """Loads and transforms the data

    Args:
     env: The path to the .env file with the raw-data path
     key: key in the environment with the path to the data
     train_validation_size: number of entries for the training/validation set
     training_fraction: what fraction of the training/valdiation set for training
    """
    env: str="posts/nlp/.env"
    key: str="QUORA_TRAIN"
    train_validation_size: int=300000
    training_fraction: float=0.8
    _data_path: Path=None
    _raw_data: pandas.DataFrame=None
    _training_data: pandas.DataFrame=None
    _testing_data: pandas.DataFrame=None
    _duplicates: pandas.DataFrame=None
    _tokenized_train: DataTokenizer=None
    _tokenized_test: DataTokenizer=None
    _vocabulary: dict=None
    _tensorized_train: DataTensorizer=None
    _tensorized_test: DataTensorizer=None
    _test_labels: pandas.Series=None    
    _data: namedtuple=None

Data Path

@property
def data_path(self) -> Path:
    """Where to find the data file"""
    if self._data_path is None:
        load_dotenv(self.env)
        self._data_path = Path(os.environ[self.key]).expanduser()
    return self._data_path

Data

@property
def raw_data(self) -> pandas.DataFrame:
    """The raw-data"""
    if self._raw_data is None:
        self._raw_data = pandas.read_csv(self.data_path)
        self._raw_data = self._raw_data[~self._raw_data.question1.isna()]
        self._raw_data = self._raw_data[~self._raw_data.question2.isna()]        
    return self._raw_data

Training Data

@property
def training_data(self) -> pandas.DataFrame:
    """The training/validation part of the data"""
    if self._training_data is None:
        self._training_data = self.raw_data.iloc[:self.train_validation_size]
    return self._training_data

Testing Data

@property
def testing_data(self) -> pandas.DataFrame:
    """The testing portion of the raw data"""
    if self._testing_data is None:
        self._testing_data = self.raw_data.iloc[self.train_validation_size:]
    return self._testing_data

Duplicates

@property
def duplicates(self) -> pandas.DataFrame:
    """training-validation data that has duplicate questions"""
    if self._duplicates is None:
        self._duplicates = self.training_data[self.training_data.is_duplicate==1]
    return self._duplicates

Train Tokenizer

@property
def tokenized_train(self) -> DataTokenizer:
    """training tokenized    
    """
    if self._tokenized_train is None:
        self._tokenized_train = DataTokenizer(self.duplicates)
    return self._tokenized_train

Test Tokenizer

@property
def tokenized_test(self) -> DataTokenizer:
    """Test Tokenizer"""
    if self._tokenized_test is None:
        self._tokenized_test = DataTokenizer(
            self.testing_data)
    return self._tokenized_test

The Vocabulary

@property
def vocabulary(self) -> dict:
    """The token:index map"""
    if self._vocabulary is None:
        self._vocabulary = defaultdict(lambda: TOKENS.unknown)
        self._vocabulary[TOKENS.padding_token] = TOKENS.padding
        combined = (self.tokenized_train.question_1
                    + self.tokenized_train.question_2)
        for index, tokens in combined.iteritems():
            tokens = (token for token in set(tokens)
                      if token not in self._vocabulary)
            for token in tokens:
                self._vocabulary[token] = len(self._vocabulary) + 1
    return self._vocabulary            

Tensorized Train

@property
def tensorized_train(self) -> DataTensorizer:
    """Tensorizer for the training data"""
    if self._tensorized_train is None:
        self._tensorized_train = DataTensorizer(
            vocabulary=self.vocabulary,
            question_1 = self.tokenized_train.question_1,
            question_2 = self.tokenized_train.question_2,
        )
    return self._tensorized_train

Tensorized Test

@property
def tensorized_test(self) -> DataTensorizer:
    """Tensorizer for the testing data"""
    if self._tensorized_test is None:
        self._tensorized_test = DataTensorizer(
            vocabulary = self.vocabulary,
            question_1 = self.tokenized_test.question_1,
            question_2 = self.tokenized_test.question_2,
        )
    return self._tensorized_test

Test Labels

@property
def test_labels(self) -> pandas.Series:
    """The labels for the test data

    0 : not duplicate questions
    1 : is duplicate
    """
    if self._test_labels is None:
        self._test_labels = self.testing_data.is_duplicate
    return self._test_labels

The Final Data

@property
def data(self) -> namedtuple:
    """The final tensorized data"""
    if self._data is None:
        cut_off = int(len(self.duplicates) * self.training_fraction)
        self._data = Data(
            train=Question(
                question_one=self.tensorized_train.tensorized_1[:cut_off].to_numpy(),
                question_two=self.tensorized_train.tensorized_2[:cut_off].to_numpy()),
            validate=Question(
                question_one=self.tensorized_train.tensorized_1[cut_off:].to_numpy(),
                question_two=self.tensorized_train.tensorized_2[cut_off:].to_numpy()),
            test=Question(
                question_one=self.tensorized_test.tensorized_1.to_numpy(),
                question_two=self.tensorized_test.tensorized_2.to_numpy()),
            y_test=self.test_labels.to_numpy(),
        )
    return self._data

Test It Out

from neurotic.nlp.siamese_networks import DataLoader

loader = DataLoader()

data = loader.data
print(f"Number of duplicate questions: {len(loader.duplicates):,}")
print(f"The length of the training set is:  {len(data.train.question_one):,}")
print(f"The length of the validation set is: {len(data.validate.question_one):,}")
Number of duplicate questions: 111,474
The length of the training set is:  89,179
The length of the validation set is: 22,295
print('first question in the train set:\n')
print(loader.duplicates.question1.iloc[0])
print('encoded version:')
print(data.train.question_one[0],'\n')
expect(data.train.question_one[0]).to(contain_exactly(*Q1_train.iloc[0]))
first question in the train set:

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
encoded version:
[7, 6, 17, 26, 22, 12, 15, 14, 2, 24, 16, 19, 31, 8, 9, 21, 25, 3, 23, 29] 

assert len(loader.vocabulary) == len(vocabulary)
assert not set(vocabulary) - set(loader.vocabulary)
print(f"{len(loader.vocabulary):,}")
77,068

Siamese Networks: Duplicate Questions

Table of Contents

Beginning

In this series of posts we will:

  • Learn about Siamese networks
  • Understand how the triplet loss works
  • Understand how to evaluate accuracy
  • Use cosine similarity between the model's outputted vectors
  • Use the data generator to get batches of questions
  • Make predictions using the own model

Evaluating a Siamese Model

Beginning

We are going to learn how to evaluate a Siamese model using the accuracy metric.

Imports

# python
from pathlib import Path
import os

# from pypi
from dotenv import load_dotenv

import trax.fastmath.numpy as trax_numpy

Set Up

load_dotenv("posts/nlp/.env")
PREFIX = "SIAMESE_"
q1 = trax_numpy.load(Path(os.environ[PREFIX + "Q1"]).expanduser())
q2 = trax_numpy.load(Path(os.environ[PREFIX + "Q2"]).expanduser())
v1 = trax_numpy.load(Path(os.environ[PREFIX + "V1"]).expanduser())
v2 = trax_numpy.load(Path(os.environ[PREFIX + "V2"]).expanduser())
y_test = trax_numpy.load(Path(os.environ[PREFIX + "Y_TEST"]).expanduser())

Middle

Data

We're going to use some pre-made data rather than start from scratch to (hopefully) make the actual evaluation clearer.

These are the data structures:

  • q1: vector with dimension (batch_size X max_length) containing first questions to compare in the test set.
  • q2: vector with dimension (batch_size X max_length) containing second questions to compare in the test set.

Notice that for each pair of vectors within a batch \(([q1_1, q1_2, q1_3, \ldots]\), \([q2_1, q2_2,q2_3, ...])\) \(q1_i\) is associated with \(q2_k\).

  • y_test: 1 if \(q1_i\) and \(q2_k\) are duplicates, 0 otherwise.
  • v1: output vector from the model's prediction associated with the first questions.
  • v2: output vector from the model's prediction associated with the second questions.
print(f'q1 has shape: {q1.shape} \n\nAnd it looks like this: \n\n {q1}\n\n')
q1 has shape: (512, 64) 

And it looks like this: 

 [[ 32  38   4 ...   1   1   1]
 [ 30 156  78 ...   1   1   1]
 [ 32  38   4 ...   1   1   1]
 ...
 [ 32  33   4 ...   1   1   1]
 [ 30 156 317 ...   1   1   1]
 [ 30 156   6 ...   1   1   1]]

The ones on the right side are padding values.

print(f'q2 has shape: {q2.shape} \n\nAnd looks like this: \n\n {q2}\n\n')
q2 has shape: (512, 64) 

And looks like this: 

 [[   30   156    78 ...     1     1     1]
 [  283   156    78 ...     1     1     1]
 [   32    38     4 ...     1     1     1]
 ...
 [   32    33     4 ...     1     1     1]
 [   30   156    78 ...     1     1     1]
 [   30   156 10596 ...     1     1     1]]
print(f'y_test has shape: {y_test.shape} \n\nAnd looks like this: \n\n {y_test}\n\n')
y_test has shape: (512,) 

And looks like this: 

 [0 1 1 0 0 0 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0
 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0
 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0 0
 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 1
 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1
 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1
 1 0 1 1 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 1 0 0
 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0
 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0
 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1
 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1
 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
print(f'v1 has shape: {v1.shape} \n\nAnd looks like this: \n\n {v1}\n\n')
v1 has shape: (512, 128) 

And looks like this: 

 [[ 0.01273625 -0.1496373  -0.01982759 ...  0.02205012 -0.00169148
  -0.01598107]
 [-0.05592084  0.05792497 -0.02226785 ...  0.08156938 -0.02570007
  -0.00503111]
 [ 0.05686752  0.0294889   0.04522024 ...  0.03141788 -0.08459651
  -0.00968536]
 ...
 [ 0.15115018  0.17791134  0.02200656 ... -0.00851707  0.00571415
  -0.00431194]
 [ 0.06995274  0.13110274  0.0202337  ... -0.00902792 -0.01221745
   0.00505962]
 [-0.16043712 -0.11899089 -0.15950686 ...  0.06544471 -0.01208312
  -0.01183368]]
print(f'v2 has shape: {v2.shape} \n\nAnd looks like this: \n\n {v2}\n\n')
v2 has shape: (512, 128) 

And looks like this: 

 [[ 0.07437647  0.02804951 -0.02974014 ...  0.02378932 -0.01696189
  -0.01897198]
 [ 0.03270066  0.15122835 -0.02175895 ...  0.00517202 -0.14617395
   0.00204823]
 [ 0.05635608  0.05454165  0.042222   ...  0.03831453 -0.05387777
  -0.01447786]
 ...
 [ 0.04727105 -0.06748016  0.04194937 ...  0.07600753 -0.03072828
   0.00400715]
 [ 0.00269269  0.15222628  0.01714724 ...  0.01482705 -0.0197884
   0.01389528]
 [-0.15475044 -0.15718803 -0.14732707 ...  0.04299919 -0.01070975
  -0.01318042]]

Calculating the accuracy

You will calculate the accuracy by iterating over the test set and checking if the model predicts right or wrong.

You will also need the batch size and the threshold that will determine if two questions are the same or not.

Note: A higher threshold means that only very similar questions will be considered as the same question.

batch_size = 512
threshold = 0.7
batch = range(batch_size)

The process is pretty straightforward:

  • Iterate over each one of the elements in the batch
  • Compute the cosine similarity between the predictions
    • For computing the cosine similarity, the two output vectors should have been normalized using L2 normalization meaning their magnitude will be 1. This has been taken care off by the Siamese network. Hence the cosine similarity here is just dot product between two vectors. You can check by implementing the usual cosine similarity formula and check if this holds or not.
  • Determine if this value is greater than the threshold (If it is, consider the two questions as the same and return 1 else 0)
  • Compare against the actual target and if the prediction matches, add 1 to the accuracy (increment the correct prediction counter)
  • Divide the accuracy by the number of processed elements
correct = 0

for row in batch:
    similarity = trax_numpy.dot(v1[row], v2[row])
    similar_enough = similarity > threshold
    correct += (y_test[element] == similar_enough)

accuracy = correct / batch_size
print(f"The accuracy of the model is: {accuracy:0.4f}.")
The accuracy of the model is: 0.6621.

Modified Triplet Loss

Beginning

We'll be looking at how to calculate the full triplet loss as well as a matrix of similarity scores.

Background

This is the original triplet loss function:

\[ \mathcal{L_\mathrm{Original}} = \max{(\mathrm{s}(A,N) -\mathrm{s}(A,P) +\alpha, 0)} \]

It can be improved by including the mean negative and the closest negative, to create a new full loss function. The inputs are the Anchor \(\mathrm{A}\), Positive \(\mathrm{P}\) and Negative \(\mathrm{N}\).

\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{Full}} &= \mathcal{L_\mathrm{1}} + \mathcal{L_\mathrm{2}}\\ \end{align}

Imports

# from pypi
import numpy

Middle

Similarity Scores

The first step is to calculate the matrix of similarity scores using cosine similarity so that you can look up \(\mathrm{s}(A,P)\), \(\mathrm{s}(A,N)\) as needed for the loss formulas.

Two Vectors

First, this is how to calculate the similarity score, using cosine similarity, for 2 vectors.

\[ \mathrm{s}(v_1,v_2) = \mathrm{cosine \ similarity}(v_1,v_2) = \frac{v_1 \cdot v_2}{||v_1||~||v_2||} \]

Similarity score

def cosine_similarity(v1: numpy.ndarray, v2: numpy.ndarray) -> float:
    """Calculates the cosine similarity between two vectors

    Args:
     v1: first vector
     v2: vector to compare to v1

    Returns:
     the cosine similarity between v1 and v2
    """
    numerator = numpy.dot(v1, v2)
    denominator = numpy.sqrt(numpy.dot(v1, v1)) * numpy.sqrt(numpy.dot(v2, v2))
    return numerator / denominator
  • Similar vectors
    v1 = numpy.array([1, 2, 3], dtype=float)
    v2 = numpy.array([1, 2, 3.5])
    
    print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
    
    cosine similarity : 0.9974
    
  • Identical Vectors
    v2 = v1
    print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
    
    cosine similarity : 1.0000
    
  • Opposite Vectors
    v2 = -v1
    print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
    
    cosine similarity : -1.0000
    
  • Dissimilar Vectors
    v2 = numpy.array([0,-42,1])
    print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
    
    cosine similarity : -0.5153
    

Two Batches of Vectors

Now let's look at how to calculate the similarity scores, using cosine similarity, for 2 batches of vectors. These are rows of individual vectors, just like in the example above, but stacked vertically into a matrix. They would look like the image below for a batch size (row count) of 4 and embedding size (column count) of 5.

The data is setup so that \(v_{1\_1}\) and \(v_{2\_1}\) represent duplicate inputs, but they are not duplicates with any other rows in the batch. This means \(v_{1\_1}\) and \(v_{2\_1}\) (green and green) have more similar vectors than say \(v_{1\_1}\) and \(v_{2\_2}\) (green and magenta).

We'll use two different methods for calculating the matrix of similarities from 2 batches of vectors.

The Input data.

v1_1 = numpy.array([1, 2, 3])
v1_2 = numpy.array([9, 8, 7])
v1_3 = numpy.array([-1, -4, -2])
v1_4 = numpy.array([1, -7, 2])
v1 = numpy.vstack([v1_1, v1_2, v1_3, v1_4])
print("v1 :")
print(v1, "\n")
v2_1 = v1_1 + numpy.random.normal(0, 2, 3)  # add some noise to create approximate duplicate
v2_2 = v1_2 + numpy.random.normal(0, 2, 3)
v2_3 = v1_3 + numpy.random.normal(0, 2, 3)
v2_4 = v1_4 + numpy.random.normal(0, 2, 3)
v2 = numpy.vstack([v2_1, v2_2, v2_3, v2_4])
print("v2 :")
print(v2, "\n")
v1 :
[[ 1  2  3]
 [ 9  8  7]
 [-1 -4 -2]
 [ 1 -7  2]] 

v2 :
[[ 1.34263076  1.18510671  1.04373534]
 [ 8.96692933  6.50763316  7.03243982]
 [-3.4497247  -6.08808183 -4.54327564]
 [-0.77144774 -9.08449817  4.4633513 ]] 

For this to work the batch sizes must match.

assert len(v1) == len(v2)

Now let's look at the similarity scores.

  • Option 1 : nested loops and the cosine similarity function
    batch_size, columns = v1.shape
    scores_1 = numpy.zeros([batch_size, batch_size])
    
    rows, columns = scores_1.shape
    
    for row in range(rows):
        for column in range(columns):
            scores_1[row, column] = cosine_similarity(v1[row], v2[column])
    
    print("Option 1 : Loop")
    print(scores_1)
    
    Option 1 : Loop
    [[ 0.88245143  0.87735873 -0.93717609 -0.14613242]
     [ 0.99999485  0.99567656 -0.95998199 -0.34214656]
     [-0.86016573 -0.81584759  0.96484391  0.60584372]
     [-0.31943701 -0.23354642  0.49063636  0.96181686]]
    
  • Option 2 : Vector Normalization and the Dot Product
    def norm(x: numpy.ndarray) -> numpy.ndarray:
        """Normalize x"""
        return x / numpy.sqrt(numpy.sum(x * x, axis=1, keepdims=True))
    
    scores_2 = numpy.dot(norm(v1), norm(v2).T)
    
    print("Option 2 : Vector Norm & dot product")
    print(scores_2)
    
    Option 2 : Vector Norm & dot product
    [[ 0.88245143  0.87735873 -0.93717609 -0.14613242]
     [ 0.99999485  0.99567656 -0.95998199 -0.34214656]
     [-0.86016573 -0.81584759  0.96484391  0.60584372]
     [-0.31943701 -0.23354642  0.49063636  0.96181686]] 
    
    

Check

Let's make sure we get the same answer in both cases.

assert numpy.allclose(scores_1, scores_2)

Hard Negative Mining

Now we'll calculate the mean negative \(mean\_neg\) and the closest negative \(close\_neg\) used in calculating \(\mathcal{L_\mathrm{1}}\) and \(\mathcal{L_\mathrm{2}}\).

\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \end{align}

We'll do this using the matrix of similarity scores for a batch size of 4. The diagonal of the matrix contains all the \(\mathrm{s}(A,P)\) values, similarities from duplicate question pairs (aka Positives). This is an important attribute for the calculations to follow.

Mean Negative

mean_neg is the average of the off diagonals, the \(\mathrm{s}(A,N)\) values, for each row.

Closest Negative

closest_neg is the largest off diagonal value, \(\mathrm{s}(A,N)\), that is smaller than the diagonal \(\mathrm{s}(A,P)\) for each row.

We'll start with some hand-made similarity scores.

similarity_scores = numpy.array(
    [
        [0.9, -0.8, 0.3, -0.5],
        [-0.4, 0.5, 0.1, -0.1],
        [0.3, 0.1, -0.4, -0.8],
        [-0.5, -0.2, -0.7, 0.5],
    ]
)

Positives

All the s(A,P) values are similarities from duplicate question pairs (aka Positives). These are along the diagonal.

sim_ap = numpy.diag(similarity_scores)
print("s(A, P) :\n")
print(numpy.diag(sim_ap))
s(A, P) :

[[ 0.9  0.   0.   0. ]
 [ 0.   0.5  0.   0. ]
 [ 0.   0.  -0.4  0. ]
 [ 0.   0.   0.   0.5]]

Negatives

All the s(A,N) values are similarities of the non duplicate question pairs (aka Negatives). These are in the cells not on the diagonal.

sim_an = similarity_scores - numpy.diag(sim_ap)
print("s(A, N) :\n")
print(sim_an)
s(A, N) :

[[ 0.  -0.8  0.3 -0.5]
 [-0.4  0.   0.1 -0.1]
 [ 0.3  0.1  0.  -0.8]
 [-0.5 -0.2 -0.7  0. ]]

Mean negative

This is the average of the s(A,N) values for each row.

batch_size = similarity_scores.shape[0]
mean_neg = numpy.sum(sim_an, axis=1, keepdims=True) / (batch_size - 1)
print("mean_neg :\n")
print(mean_neg)
mean_neg :

[[-0.33333333]
 [-0.13333333]
 [-0.13333333]
 [-0.46666667]]

Closest negative

These are the Max s(A,N) that is <= s(A,P) for each row.

mask_1 = numpy.identity(batch_size) == 1            # mask to exclude the diagonal
mask_2 = sim_an > sim_ap.reshape(batch_size, 1)  # mask to exclude sim_an > sim_ap
mask = mask_1 | mask_2
sim_an_masked = numpy.copy(sim_an)         # create a copy to preserve sim_an
sim_an_masked[mask] = -2

closest_neg = numpy.max(sim_an_masked, axis=1, keepdims=True)
print("Closest Negative :\n")
print(closest_neg)
Closest Negative :

[[ 0.3]
 [ 0.1]
 [-0.8]
 [-0.2]]

The Loss Functions

The last step is to calculate the loss functions.

\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{Full}} &= \mathcal{L_\mathrm{1}} + \mathcal{L_\mathrm{2}}\\ \end{align}

The Alpha margin.

alpha = 0.25

Modified triplet loss

loss_1 = numpy.maximum(mean_neg - sim_ap.reshape(batch_size, 1) + alpha, 0)
loss_2 = numpy.maximum(closest_neg - sim_ap.reshape(batch_size, 1) + alpha, 0)
loss_full = loss_1 + loss_2

Cost

cost = numpy.sum(loss_full)
print("Loss Full :\n")
print(loss_full)
print(f"\ncost : {cost:.3f}")
Loss Full :

[[0.        ]
 [0.        ]
 [0.51666667]
 [0.        ]]

cost : 0.517

Siamese Networks With Trax

Beginning

Imports

# pypi
from jax.interpreters.xla import _DeviceArray as DeviceArray
from trax import layers

import numpy
import trax
import trax.fastmath.numpy as fast_numpy

Middle

L2 Normalization

Before building the model you will need to define a function that applies L2 normalization to a tensor. Luckily this is pretty straightforward.

def normalize(x: numpy.ndarray) -> DeviceArray:
    """L2 Normalization

    Args:
     x: the data to normalize

    Returns:
     normalized version of x
    """
    return x / fast_numpy.sqrt(fast_numpy.sum(x * x, axis=-1, keepdims=True))

The denominator can be replaced by np.linalg.norm(x, axis-1, keepdims=True)= to achieve the same result.

tensor = numpy.random.random((2,5))
print(f'The tensor is of type: {type(tensor)}\n\nAnd looks like this:\n\n {tensor}')
The tensor is of type: <class 'numpy.ndarray'>

And looks like this:

 [[0.68535982 0.95339335 0.00394827 0.51219226 0.81262096]
 [0.61252607 0.72175532 0.29187607 0.91777412 0.71457578]]
norm_tensor = normalize(tensor)
print(f'The normalized tensor is of type: {type(norm_tensor)}\n\nAnd looks like this:\n\n {norm_tensor}')
The normalized tensor is of type: <class 'jax.interpreters.xla._DeviceArray'>

And looks like this:

 [[0.45177674 0.6284596  0.00260263 0.33762783 0.535665  ]
 [0.40091467 0.47240815 0.1910407  0.6007077  0.46770892]]

Notice that the initial tensor was converted from a numpy array to a jax array in the process.

The Siamese Model

To create a Siamese model you will first need to create a LSTM model using the Serial combinator layer and then use another combinator layer called Parallel to create the Siamese model. You should be familiar with the following layers:

  • Serial : A combinator layer that allows to stack layers serially using functioncomposition.
  • Embedding : Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding.

-LSTM : The LSTM layer. It leverages another Trax layer called LSTMCell. The number of units should be specified and should match the number of elements in the word embedding.

  • Mean Computes the mean across a desired axis. Mean uses one tensor axis to form groups of values and replaces each group with the mean value of that group.
  • Fn Layer with no weights that applies the function f, which should be specified using a lambda syntax.
  • Parallel It is a combinator layer (like Serial) that applies a list of layers in parallel to its inputs.

Putting everything together the Siamese model looks like this:

vocab_size = 500
model_dimension = 128

# Define the LSTM model
LSTM = layers.Serial(
        layers.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
        layers.LSTM(model_dimension),
        layers.Mean(axis=1),
        layers.Fn('Normalize', lambda x: normalize(x))
    )

# Use the Parallel combinator to create a Siamese model out of the LSTM 
Siamese = layers.Parallel(LSTM, LSTM)

Next is a helper function that prints information for every layer (sublayer within Serial):

def show_layers(model, layer_prefix):
    print(f"Total layers: {len(model.sublayers)}\n")
    for i in range(len(model.sublayers)):
        print('========')
        print(f'{layer_prefix}_{i}: {model.sublayers[i]}\n')
print('Siamese model:\n')
show_layers(Siamese, 'Parallel.sublayers')
Siamese model:

Total layers: 2

========
Parallel.sublayers_0: Serial[
  Embedding_500_128
  LSTM_128
  Mean
  Normalize
]

========
Parallel.sublayers_1: Serial[
  Embedding_500_128
  LSTM_128
  Mean
  Normalize
]
print('Detail of LSTM models:\n')
show_layers(LSTM, 'Serial.sublayers')
Detail of LSTM models:

Total layers: 4

========
Serial.sublayers_0: Embedding_500_128

========
Serial.sublayers_1: LSTM_128

========
Serial.sublayers_2: Mean

========
Serial.sublayers_3: Normalize

End

NER: Testing the Model

Testing New Sentences

# python
from pathlib import Path

# pypi
from trax import layers

import numpy

# this project
from neurotic.nlp.named_entity_recognition import (NER,
                                                   NERData,
                                                   TOKEN)

Set Up the Model and Maps

data = NERData().data
model = NER(vocabulary_size=len(data.vocabulary),
            tag_count=len(data.tags)).model
model.init_from_file(Path("~/models/ner/model.pkl.gz", weights_only=True).expanduser())
print(model)
Serial[
  Embedding_35180_50
  LSTM_50
  Dense_18
  LogSoftmax
]

Middle

def predict(sentence: str,
            model: layers.Serial=model,
            vocabulary: dict=data.vocabulary,
            tags: dict=data.tags,
            unknown: str=data.vocabulary[TOKEN.unknown]) -> list:
    """Predicts the named entities in a sentence

    Args:
     sentence: the sentence to analyze
     model: the NER model
     vocabulary: token to id map
     tags: tag to id map
     unknown: key in the vocabulary for unknown tokens
    """
    tokens = [vocabulary.get(token, unknown)
              for token in sentence.split()]
    batch_data = numpy.ones((1, len(tokens)))
    batch_data[0][:] = tokens
    sentence = numpy.array(batch_data).astype(int)
    output = model(sentence)
    outputs = numpy.argmax(output, axis=-1)
    labels = list(tags.keys())

    indices = (outputs[0][index] for index in range(len(outputs[0])))
    predictions = [labels[index] for index in indices]
    return predictions
sentence = "Bilbo Baggins, the Shire's director of trade and manufacturing policy for the Lord Sauron, said in an interview on Sunday morning that Rumblefish was working to prepare for the possibility of a second wave of the Coronavirus in the Fall, although he said it wouldn’t necessarily come before the fall of the Empire and the rise of the corpse brigade in July"

def print_predictions(sentence: str):
    predictions = predict(sentence)
    for word, entity in zip(sentence.split(), predictions):
        if entity != 'O':
            print(f"{word} - {entity}")
    return

print_predictions(sentence)
Lord - B-org
Sauron, - I-org
Sunday - B-tim
morning - I-tim
July - B-tim
print_predictions("anyone lived in a pretty how town "
                  "(with up so floating many bells down) "
                  "spring summer autumn winter "
                  "he sang his didn't he danced his did.")
summer - I-tim
autumn - I-tim

Hmm, that's interesting.

print_predictions("Spring Summer Autumn Winter")
Summer - B-eve

Some kind of anti-spring bias.

print_predictions("Boogie booty bunny butt")
booty - B-per

Well, I suppose I'd have to match the dataset to put more weird things in there.

NER: Evaluating the Model

Beginning

Now we'll evaluate our model using the test set. To do this we'll need to create a mask to avoid counting the padding tokens when computing the accuracy.

  • Step 1: Calling model(sentences) will give us the predicted output.
  • Step 2: The output will be the prediction with an added dimension. For each word in each sentence there will be a vector of probabilities for each tag type. For each word in each sentence we'll need to pick the maximum valued tag. This will require np.argmax and careful use of the axis argument.
  • Step 3: Create a mask to prevent counting pad characters. It will have the same dimensions as the output.
  • Step 4: Compute the accuracy metric by comparing the outputs against the test labels. Take the sum of that and divide by the total number of unpadded tokens. Use the mask value to mask the padded tokens.

Imports

# python
from collections import namedtuple
from pathlib import Path

# pypi
import holoviews
import hvplot.pandas
import jax
import numpy
import pandas
import trax

# this project
from neurotic.nlp.named_entity_recognition import (DataGenerator,
                                                   NER,
                                                   NERData,
                                                   TOKEN)
# another project
from graeae import EmbedHoloviews

Set Up

Plotting

slug = "ner-evaluating-the-model"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{slug}")

Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

The Previous Code

data = NERData().data
model = NER(vocabulary_size=len(data.vocabulary),
            tag_count=len(data.tags)).model

Settings = namedtuple("Settings", ["batch_size", "padding_id", "seed"])
SETTINGS = Settings(batch_size=64,
                    padding_id=data.vocabulary[TOKEN.pad],
                    seed=33)

model.init_from_file(Path("~/models/ner/model.pkl.gz").expanduser())
print(model)

random.seed(SETTINGS.seed)

test_generator = DataGenerator(x=ner.data.data_sets.x_test,
                                   y=data.data_sets.y_test,
                                   batch_size=SETTINGS.batch_size,
                                   padding=SETTINGS.padding_id)
Serial[
  Embedding_35180_50
  LSTM_50
  Dense_18
  LogSoftmax
]

Middle

As a reminder, here's what happens when you apply a boolean comparison to a numpy array.

a = numpy.array([1, 2, 3, 4])
print(a == 2)
[False  True False False]

A Test Input

x, y = next(test_generator)
print(f"x's shape: {x.shape} y's shape: {y.shape}")

predictions = model(x)
print(type(predictions))
print(f"predictions has shape: {predictions.shape}")
x's shape: (64, 44) y's shape: (64, 44)
<class 'jax.interpreters.xla._DeviceArray'>
predictions has shape: (64, 44, 18)

Note: the model's prediction has 3 axes:

  • the number of examples
  • the number of words in each example (padded to be as long as the longest sentence in the batch)
  • the number of possible targets (the 17 named entity tags).
def evaluate_prediction(pred: jax.interpreters.xla._DeviceArray,
                        labels: numpy.ndarray,
                        pad: int=SETTINGS.padding_id) -> float:
    """Calculates the accuracy of a prediction

    Args:
      pred: prediction array with shape 
           (num examples, max sentence length in batch, num of classes)
      labels: array of size (batch_size, seq_len)
      pad: integer representing pad character

    Returns:
      accuracy: fraction of correct predictions
    """
    outputs = numpy.argmax(pred, axis=-1)
    mask = labels != pad
    return numpy.sum((outputs==labels)[mask])/numpy.sum(mask)
accuracy = evaluate_prediction(model(x), y)
print("accuracy: ", accuracy)
accuracy:  0.9636752

Hmm, does pretty good.

Plotting

Let's look at running more batches. It occurred to me that you could also just do the whole set at once, I don't know what's special about using the batches.

repetitions = range(
    int(len(data.data_sets.x_test)/SETTINGS.batch_size))
nexts = (next(test_generator) for repetition in repetitions)
accuracies = [evaluate_prediction(model(x), y) for x, y in nexts]
data = pandas.DataFrame.from_dict(dict(Accuracy=accuracies))
plot = data.Accuracy.hvplot(kind="hist", color=PLOT.tan).opts(
    title="Accuracy Distribution",
    height=PLOT.height,
    width=PLOT.width,
    fontscale=PLOT.fontscale)

output = Embed(plot=plot, file_name="accuracy_distribution")()
print(output)

Figure Missing

NER: Training the Model

Training the Model

Imports

# from python
from collections import namedtuple
from functools import partial
from tempfile import TemporaryFile

import random
import sys

# from pypi
from holoviews import opts
from trax import layers
from trax.supervised import training

import holoviews
import hvplot.pandas
import pandas
import trax

# this project
from neurotic.nlp.named_entity_recognition import (DataGenerator,
                                                   NER,
                                                   NERData,
                                                   TOKEN)
# another project
from graeae import EmbedHoloviews, Timer

Set Up

Plotting

slug = "ner-training-the-model"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{slug}")

Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Data

ner = NERData()

Settings = namedtuple("Settings", ["seed", "batch_size", "embedding_size", "learning_rate"])
SETTINGS = Settings(seed=33, batch_size=64, embedding_size=50, learning_rate=0.01)
trainee = NER(vocabulary_size=len(ner.data.vocabulary),
              tag_count=len(ner.data.tags))
random.seed(SETTINGS.seed)

training_generator = DataGenerator(x=ner.data.data_sets.x_train,
                                   y=ner.data.data_sets.y_train,
                                   batch_size=SETTINGS.batch_size,
                                   padding=ner.data.vocabulary[TOKEN.pad])

validation_generator = DataGenerator(x=ner.data.data_sets.x_validate,
                                     y=ner.data.data_sets.y_validate,
                                     batch_size=SETTINGS.batch_size,
                                     padding=ner.data.vocabulary[TOKEN.pad])

TIMER = Timer(speak=False)

Middle

The Data Generators

Before we start, we need to create the data generators for training and validation data. It is important that you mask padding in the loss weights of your data, which can be done using the id_to_mask argument of trax.supervised.inputs.add_loss_weights.

train_generator = trax.data.inputs.add_loss_weights(
    training_generator,
    id_to_mask=ner.data.vocabulary[TOKEN.pad])

evaluate_generator = trax.data.inputs.add_loss_weights(
    validation_generator,
    id_to_mask=ner.data.vocabulary[TOKEN.pad])

Training The Model

You will now write a function that takes in your model and trains it.

As you've seen in the previous assignments, you will first create the TrainTask and EvalTask using your data generator. Then you will use the training.Loop to train your model.

Instructions: Implement the train_model program below to train the neural network above. Here is a list of things you should do:

You'll be using a cross entropy loss, with an Adam optimizer. Please read the trax documentation to get a full understanding. The trax GitHub also contains some useful information and a link to a colab notebook.

def train_model(NER: trax.layers.Serial,
                train_generator: type,
                eval_generator: type,
                train_steps: int=1,
                steps_per_checkpoint: int=100,
                learning_rate: float=SETTINGS.learning_rate,
                verbose: bool=False,
                output_dir="~/models/ner/") -> training.Loop:
    """Train the Named Entity Recognition Model
    Args: 
      NER: the model you are building
      train_generator: The data generator for training examples
      eval_generator: The data generator for validation examples,
      train_steps: number of training steps
      output_dir: folder to save your model

    Returns:
      training_loop: a trax supervised training Loop
    """
    train_task = training.TrainTask(
        labeled_data=train_generator,
        loss_layer = layers.WeightedCategoryCrossEntropy(),
        optimizer = trax.optimizers.Adam(learning_rate),
        n_steps_per_checkpoint=steps_per_checkpoint,
    )

    eval_task = training.EvalTask(
      labeled_data = eval_generator,
      metrics = [layers.WeightedCategoryCrossEntropy(),
                 layers.Accuracy()],
      n_eval_batches = SETTINGS.batch_size
    )

    training_loop = training.Loop(
        NER,
        train_task,
        eval_tasks=[eval_task],
        output_dir=output_dir)

    if verbose:
        print(f"Running {train_steps} steps")
    training_loop.run(n_steps = train_steps)
    return training_loop

For some reason they don't give you the option to turn off the print statements so I'm going to suppress all stdout.

training_steps = 1500
real_stdout = sys.stdout

TIMER.emit = False
TIMER.start()
with TemporaryFile("w") as temp_file:
    sys.stdout = temp_file
    training_loop = train_model(trainee.model, train_generator,
                                evaluate_generator,
                                steps_per_checkpoint=10,
                                train_steps=training_steps,
                                verbose=False)
TIMER.stop()
sys.stdout = real_stdout
print(f"{TIMER.ended - TIMER.started}")
0:03:51.538599

Plotting the Metrics

Accuracy

history = training_loop.history
frame = pandas.DataFrame(history.get("eval", "metrics/Accuracy"),
                         columns="Batch Accuracy".split())
maximum = frame.loc[frame.Accuracy.idxmax()]
vline = holoviews.VLine(maximum.Batch).opts(opts.VLine(color=PLOT.red))
hline = holoviews.HLine(maximum.Accuracy).opts(opts.HLine(color=PLOT.red))
line = frame.hvplot(x="Batch",
                    y="Accuracy").opts(
                        opts.Curve(color=PLOT.blue))

plot = (line * hline * vline).opts(
    width=PLOT.width,
    height=PLOT.height, title="Evaluation Batch Accuracy",
                                   )
output = Embed(plot=plot, file_name="evaluation_accuracy")()
print(output)

Figure Missing

Plotting Loss

frame = pandas.DataFrame(history.get("eval",
                                     "metrics/WeightedCategoryCrossEntropy"),
                         columns="Batch Loss".split())
minimum = frame.loc[frame.Loss.idxmin()]
vline = holoviews.VLine(minimum.Batch).opts(opts.VLine(color=PLOT.red))
hline = holoviews.HLine(minimum.Loss).opts(opts.HLine(color=PLOT.red))
line = frame.hvplot(x="Batch", y="Loss").opts(opts.Curve(color=PLOT.blue))

plot = (line * hline * vline).opts(
    width=PLOT.width, height=PLOT.height,
    title="Evaluation Batch Cross Entropy",
                                   )
output = Embed(plot=plot, file_name="evaluation_cross_entropy")()
print(output)

Figure Missing

So it looks like I passed the best point again and am probably overfitting. I wonder if they have a callback to grab the best model like pytorch does? I'm surprised at how fast these models train.

NER: Building the Model

Beginning

Here we'll actually build the model.

  • Feed the data into an Embedding layer, to produce more semantic entries
  • Feed it into an LSTM layer
  • Run the output through a linear layer
  • Run the result through a log softmax layer to get the predicted class for each word.

Imports

# pypi
from trax import layers

# this project
from neurotic.nlp.named_entity_recognition import DataGenerator, NERData, TOKEN

Set Up

ner = NERData()

vocab = vocabulary = ner.data.vocabulary
tag_map = tags = ner.data.tags

Middle

These are the Trax components we'll use (the links are to the implementations on Github).

  • tl.Serial: Combinator that applies layers serially (by function composition).
  • tl.Embedding: Initializes the embedding. In this case it is the dimension of the model by the size of the vocabulary.
    • tl.Embedding(vocab_size, d_feature).
    • vocab_size is the number of unique words in the given vocabulary.
    • d_feature is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
  • tl.LSTM:=Trax= LSTM layer of size d_model.
    • LSTM(n_units) Builds an LSTM layer of n_cells.
  • tl.Dense: A dense layer.
    • tl.Dense(n_units): The parameter n_units is the number of units chosen for this dense layer.
  • tl.LogSoftmax: Log of the output probabilities.
    • Here, you don't need to set any parameters for LogSoftMax().

Online documentation

def NER(vocab_size: int=35181, d_model: int=50, tags: dict=tag_map):
    """
    Args: 
      vocab_size: number of words in the vocabulary
      d_model: the embedding size

    Returns:
       model: a trax serial model
    """
    model = layers.Serial(
        layers.Embedding(vocab_size, d_feature=d_model),
        layers.LSTM(d_model),
        layers.Dense(n_units=len(tag_map)),
        layers.LogSoftmax()
      )
    return model

Inspecting the Model

model = NER()
# display your model
print(model)
Serial[
  Embedding_35181_50
  LSTM_50
  Dense_18
  LogSoftmax
]

Pack It Up for Later

Imports

# python
from collections import namedtuple

# pypi
from trax import layers

import attr

Constants

Settings = namedtuple("Settings", ["embeddings_size"])
SETTINGS = Settings(50)

The Model

@attr.s(auto_attribs=True)
class NER:
    """The named entity recognition model

    Args:
     vocabulary_size: number of tokens in the vocabulary
     tag_count: number of tags
     embeddings_size: the number of features in the embeddings layer
    """
    vocabulary_size: int
    tag_count: int
    embeddings_size: int=SETTINGS.embeddings_size
    _model: layers.Serial=None
  • The Actual Model
    @property
    def model(self) -> layers.Serial:
        """The NER model instance"""
        if self._model is None:
            self._model = layers.Serial(
                layers.Embedding(self.vocabulary_size,
                                 d_feature=self.embeddings_size),
                layers.LSTM(self.embeddings_size),
                layers.Dense(n_units=self.tag_count),
                layers.LogSoftmax()
          )
        return self._model
    

Sanity Check

from neurotic.nlp.named_entity_recognition import NER

builder = NER(122, 666)

print(builder.model)
Serial[
  Embedding_122_50
  LSTM_50
  Dense_666
  LogSoftmax
]

NER: Data

The Data

Imports

# from python
import random

# from pypi
import numpy

# this project
from neurotic.nlp.named_entity_recognition import NERData, TOKEN

Set Up

ner = NERData()

# to make the functions pass we need to use their names (initially)
vocab = vocabulary = ner.data.vocabulary
tag_map = tags = ner.data.tags

Middle

Reviewing The Dataset

As a review we can look at what's in the vocabulary.

print(vocabulary["the"])
print(vocabulary[TOKEN.pad])
print(vocabulary["The"])
9
35178
61

The vocabulary maps words in our vocabulary to unique integers. As you can see, we made it case-sensitive.

We also made a map for tags.

for tag, index in tags.items():
    print(f" - {tag}: {index}")
- O: 0
- B-geo: 1
- B-gpe: 2
- B-per: 3
- I-geo: 4
- B-org: 5
- I-org: 6
- B-tim: 7
- B-art: 8
- I-art: 9
- I-per: 10
- I-gpe: 11
- I-tim: 12
- B-nat: 13
- B-eve: 14
- I-eve: 15
- I-nat: 16
- UNK: 17
Prefix Interpretation
B Token Begins an entity
I Token is Inside an entity

This is to help when you have multi-token entities. So if you had the name "Burt Reynolds", "Burt" would be tagged B-per and "Reynolds" would be tagged "I-per".

print(f"The number of tags is {len(tag_map)}")
print(f"The vocabulary size is {len(vocab):,}")
print(f"The training size is {len(ner.data.data_sets.x_train):,}")
print(f"The validation size is {len(ner.data.data_sets.x_validate):,}")
print("The first training sentence is ")
print(f"'{' '.join(ner.data.raw_data_sets.x_train[0])}'")
print("Its corresponding label is")
print(f" '{' '.join(ner.data.raw_data_sets.y_train[0])}'")

print("The first training encoded sentence is ")
print(f"{ner.data.data_sets.x_train[0]}")
print("Its corresponding encoded label is")
print(f"{ner.data.data_sets.y_train[0]}")
The number of tags is 18
The vocabulary size is 35,180
The training size is 33,570
The validation size is 7,194
The first training sentence is 
'Opposition leader Michael Howard said he hopes the government in coming weeks will try to uncover possible security flaws exploited in the attacks .'
Its corresponding label is
 'O O B-per I-per O O O O O O O O O O O O O O O O O O O O'
The first training encoded sentence is 
[7848, 538, 5951, 6187, 172, 502, 2453, 9, 293, 11, 5306, 822, 141, 1962, 7, 26689, 1176, 686, 11905, 14806, 11, 9, 292, 21]
Its corresponding encoded label is
[0, 0, 3, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

A Data Generator

The generator will have a main outer loop:

while True:  
    yield((X,Y))  

runs continuously in the fashion of generators, pausing when yielding the next values. We will generate a batch_size output on each pass of this loop.

It has two inner loops.

  1. The first stores in temporal lists the data samples to be included in the next batch, and finds the maximum length of the sentences contained in it. By adjusting the length to include only the size of the longest sentence in each batch, overall computation is reduced.
  2. The second loop moves those inputs from the temporal list into NumPy arrays pre-filled with pad values.

There are three slightly out of the ordinary features.

  1. The first is the use of the NumPy full function to fill the NumPy arrays with a pad value. See full function documentation.
  2. The second is tracking the current location in the incoming lists of sentences. Generators variables hold their values between invocations, so we create an index variable, initialize to zero, and increment by one for each sample included in a batch. However, we do not use the index to access the positions of the list of sentences directly. Instead, we use it to select one index from a list of indexes. In this way, we can change the order in which we traverse our original list, keeping untouched our original list.
  3. The third also relates to wrapping. Because batch_size and the length of the input lists are not aligned, gathering a batch_size group of inputs may involve wrapping back to the beginning of the input loop. In our approach, it is just enough to reset the index to 0. We can re-shuffle the list of indexes to produce different batches each time.
def data_generator(batch_size: int, x: list, y: list, pad: int,
                   shuffle: bool=False, verbose: bool=False):
    """Generate batches of data for training

    Args: 
      batch_size - size of each batch generated
      x - sentences where words are represented as integers
      y - tags associated with the sentences
      pad - number to use as the padding character
      shuffle - Whether to shuffle the data
      verbose - Whether to print information to stdout

    Yields:
     a tuple containing 2 elements:
       X - np.ndarray of dim (batch_size, max_len) of padded sentences
       Y - np.ndarray of dim (batch_size, max_len) of tags associated with the sentences in X
    """    
    # count the number of lines in data_lines
    num_lines = len(x)

    # create an array with the indexes of data_lines that can be shuffled
    lines_index = list(range(num_lines))

    # shuffle the indexes if shuffle is set to True
    if shuffle:
        random.shuffle(lines_index)

    index = 0 # tracks current location in x, y
    while True:
        buffer_x = [0] * batch_size
        buffer_y = [0] * batch_size
        max_len = 0
        for i in range(batch_size):
             # if the index is greater than or equal to the number of lines in x
            if index >= num_lines:
                # then reset the index to 0
                index = 0
                # re-shuffle the indexes if shuffle is set to True
                if shuffle:
                    random.shuffle(lines_index)

            # The current position is obtained using `lines_index[index]`
            # Store the x value at the current position into the buffer_x
            buffer_x[i] = x[lines_index[index]]

            # Store the y value at the current position into the buffer_y
            buffer_y[i] = y[lines_index[index]]

            lenx = len(buffer_x[i])    #length of current x[]
            if lenx > max_len:
                max_len = lenx                   #max_len tracks longest x[]

            # increment index by one
            index += 1


        # create X,Y, NumPy arrays of size (batch_size, max_len) 'full' of pad value
        X = numpy.full((batch_size, max_len), pad)
        Y = numpy.full((batch_size, max_len), pad)

        # copy values from lists to NumPy arrays. Use the buffered values
        for i in range(batch_size):
            # get the example (sentence as a tensor)
            # in `buffer_x` at the `i` index
            x_i = buffer_x[i]

            # similarly, get the example's labels
            # in `buffer_y` at the `i` index
            y_i = buffer_y[i]

            # Walk through each word in x_i
            for j in range(len(x_i)):
                # store the word in x_i at position j into X
                X[i, j] = x_i[j]

                # store the label in y_i at position j into Y
                Y[i, j] = y_i[j]

        if verbose: print("index=", index)
        yield((X,Y))
batch_size = 5
mini_sentences = ner.data.data_sets.x_train[0: 8]
mini_labels = ner.data.data_sets.y_train[0: 8]
dg = data_generator(batch_size, mini_sentences, mini_labels, vocab["<PAD>"], shuffle=False, verbose=True)
X1, Y1 = next(dg)
X2, Y2 = next(dg)
print(Y1.shape, X1.shape, Y2.shape, X2.shape)
print(X1[0][:], "\n", Y1[0][:])
index= 5
index= 2
(5, 27) (5, 27) (5, 24) (5, 24)
[ 7848   538  5951  6187   172   502  2453     9   293    11  5306   822
   141  1962     7 26689  1176   686 11905 14806    11     9   292    21
 35178 35178 35178] 
 [    0     0     3    10     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
 35178 35178 35178]

Bundle It Up

Imports

# from python
from typing import List, Tuple
import random

# from pypi
import attr
import numpy

Some Types

Vectors = List[List[int]]
Batch = Tuple[numpy.ndarray]

The Data Generator

@attr.s(auto_attribs=True)
class DataGenerator:
    """A generator of data to train the NER Model

    Args:
     batch_size: how many lines to generate at once
     x: the encoded sentences
     y: the encoded labels 
     padding: encoding to use for padding lines
     shuffle: whether to shuffle the data
     verbose: whether to print messages to stdout
    """
    batch_size: int
    x: Vectors
    y: Vectors
    padding: int
    shuffle: bool=False
    verbose: bool=False
    _batch: iter=None

The Batch Generator

def batch_generator(self):
    """Generates batches"""
    line_count = len(self.x)
    line_indices = list(range(line_count))

    if self.shuffle:
        random.shuffle(line_indices)
    index = 0

    while True:
        x_batch = [0] * self.batch_size
        y_batch = [0] * self.batch_size
        longest = 0
        for batch_index in range(self.batch_size):
            if index >= line_count:
                index = 0
                if self.shuffle:
                    random.shuffle(line_indices)

            x_batch[batch_index] = self.x[line_indices[index]]
            y_batch[batch_index] = self.y[line_indices[index]]

            longest = max(longest, len(x_batch[batch_index]))
            index += 1

        X = numpy.full((self.batch_size, longest), self.padding)
        Y = numpy.full((self.batch_size, longest), self.padding)

        for batch_index in range(self.batch_size): 
            line = x_batch[batch_index]
            label = y_batch[batch_index]

            for word in range(len(line)):
                X[batch_index, word] = line[word]
                Y[batch_index, word] = label[word]

        if self.verbose:
            print("index=", index)
        yield (X,Y)
    return    

The Generator Method

@property
def batch(self):
    """The instance of the generator"""
    if self._batch is None:
        self._batch = self.batch_generator()
    return self._batch

The Iterator Method

def __iter__(self):
    return self

The Next Method

def __next__(self) -> Batch:
    return next(self.batch)

Test It

from neurotic.nlp.named_entity_recognition import DataGenerator

generator = DataGenerator(x=ner.data.data_sets.x_train[0:8],
                          y=ner.data.data_sets.y_train[0: 8],
                          batch_size=5,
                          padding=vocabulary[TOKEN.pad])

X1, Y1 = next(generator)
X2, Y2 = next(generator)
print(Y1.shape, X1.shape, Y2.shape, X2.shape)
print(X1[0][:], "\n", Y1[0][:])
(5, 27) (5, 27) (5, 24) (5, 24)
[ 7848   538  5951  6187   172   502  2453     9   293    11  5306   822
   141  1962     7 26689  1176   686 11905 14806    11     9   292    21
 35178 35178 35178] 
 [    0     0     3    10     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
 35178 35178 35178]