Neural Machine Translation: The Data

The Data

This is the first post in a series that will look at creating a Long-Short-Term-Memory (LSTM) model with attention for Machine Learning. The previous post was an overview that holds the links to all the posts in the series.

Imports

# python
from pathlib import Path

import random

# pypi
from termcolor import colored

import numpy
import trax

Middle

Loading the Data

Next, we will import the dataset we will use to train the model. If you are running out of space, you can just use a small dataset from Opus, a growing collection of translated texts from the web. Particularly, we will get an English to German translation subset specified as opus/medical which has medical related texts. If storage is not an issue, you can opt to get a larger corpus such as the English to German translation dataset from ParaCrawl, a large multi-lingual translation dataset created by the European Union. Both of these datasets are available via Tensorflow Datasets (TFDS) and you can browse through the other available datasets here. As you'll see below, you can easily access this dataset from TFDS with trax.data.TFDS. The result is a python generator function yielding tuples. Use the keys argument to select what appears at which position in the tuple. For example, keys=('en', 'de') below will return pairs as (English sentence, German sentence).

The para_crawl/ende dataset is 4.04 GiB while the opus/medical dataset is 188.85 MiB.

Note: Trying to download the ParaCrawl dataset using trax creates an out of resource error. You can try downloading the source from:

https://s3.amazonaws.com/web-language-models/paracrawl/release4/en-de.bicleaner07.txt.gz

Although I haven't figured out how to get it into the trax data yet so I'm sticking with the smaller data set.

The Training Data

The first time you run this it will download the dataset, after that it will just load it from the file.

path = Path("~/data/tensorflow/translation/").expanduser()

data_set = "opus/medical"
# data_set = "para_crawl/ende"

train_stream_fn = trax.data.TFDS(data_set,
                                 data_dir=path,
                                 keys=('en', 'de'),
                                 eval_holdout_size=0.01,
                                 train=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-fb62d04026f5> in <module>
      4 # data_set = "para_crawl/ende"
      5 
----> 6 train_stream_fn = trax.data.TFDS(data_set,
      7                                  data_dir=path,
      8                                  keys=('en', 'de'),

/usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1067       scope_info = " in scope '{}'".format(scope_str) if scope_str else ''
   1068       err_str = err_str.format(name, fn_or_cls, scope_info)
-> 1069       utils.augment_exception_message_and_reraise(e, err_str)
   1070 
   1071   return gin_wrapper

/usr/local/lib/python3.8/dist-packages/gin/utils.py in augment_exception_message_and_reraise(exception, message)
     39   proxy = ExceptionProxy()
     40   ExceptionProxy.__qualname__ = type(exception).__qualname__
---> 41   raise proxy.with_traceback(exception.__traceback__) from None
     42 
     43 

/usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1044 
   1045     try:
-> 1046       return fn(*new_args, **new_kwargs)
   1047     except Exception as e:  # pylint: disable=broad-except
   1048       err_str = ''

/usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1067       scope_info = " in scope '{}'".format(scope_str) if scope_str else ''
   1068       err_str = err_str.format(name, fn_or_cls, scope_info)
-> 1069       utils.augment_exception_message_and_reraise(e, err_str)
   1070 
   1071   return gin_wrapper

/usr/local/lib/python3.8/dist-packages/gin/utils.py in augment_exception_message_and_reraise(exception, message)
     39   proxy = ExceptionProxy()
     40   ExceptionProxy.__qualname__ = type(exception).__qualname__
---> 41   raise proxy.with_traceback(exception.__traceback__) from None
     42 
     43 

/usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1044 
   1045     try:
-> 1046       return fn(*new_args, **new_kwargs)
   1047     except Exception as e:  # pylint: disable=broad-except
   1048       err_str = ''

~/trax/trax/data/tf_inputs.py in TFDS(dataset_name, data_dir, tfds_preprocess_fn, keys, train, shuffle_train, host_id, n_hosts, eval_holdout_size)
    279   else:
    280     subsplit = None
--> 281   (train_data, eval_data, _) = _train_and_eval_dataset(
    282       dataset_name, data_dir, eval_holdout_size,
    283       train_shuffle_files=shuffle_train, subsplit=subsplit)

~/trax/trax/data/tf_inputs.py in _train_and_eval_dataset(dataset_name, data_dir, eval_holdout_size, train_shuffle_files, eval_shuffle_files, subsplit)
    224   if eval_holdout_examples > 0 or subsplit is not None:
    225     n_train = train_examples - eval_holdout_examples
--> 226     train_start = int(n_train * subsplit[0])
    227     train_end = int(n_train * subsplit[1])
    228     if train_end - train_start < 1:

TypeError: 'NoneType' object is not subscriptable
  In call to configurable 'TFDS' (<function TFDS at 0x7f960c527280>)
  In call to configurable 'TFDS' (<function TFDS at 0x7f960c526f70>)

The Evaluation Data

Since we already downloaded the data in the previous code-block, this will just load the evaluation set from the downloaded data.

eval_stream_fn = trax.data.TFDS('opus/medical',
                                data_dir=path,
                                keys=('en', 'de'),
                                eval_holdout_size=0.01,
                                train=False)

Notice that TFDS returns a generator function, not a generator. This is because in Python, you cannot reset generators so you cannot go back to a previously yielded value. During deep learning training, you use Stochastic Gradient Descent and don't actually need to go back – but it is sometimes good to be able to do that, and that's where the functions come in. Let's print a a sample pair from our train and eval data. Notice that the raw output is represented in bytes (denoted by the b' prefix) and these will be converted to strings internally in the next steps.

train_stream = train_stream_fn()
print(colored('train data (en, de) tuple:', 'red'), next(train_stream))
print()
[31mtrain data (en, de) tuple:[0m (b'Tel: +421 2 57 103 777\n', b'Tel: +421 2 57 103 777\n')

eval_stream = eval_stream_fn()
print(colored('eval data (en, de) tuple:', 'red'), next(eval_stream))
[31meval data (en, de) tuple:[0m (b'Lutropin alfa Subcutaneous use.\n', b'Pulver zur Injektion Lutropin alfa Subkutane Anwendung\n')

Tokenization and Formatting

Now that we have imported our corpus, we will be preprocessing the sentences into a format that our model can accept. This will be composed of several steps:

Tokenizing the sentences using subword representations: We want to represent each sentence as an array of integers instead of strings. For our application, we will use subword representations to tokenize our sentences. This is a common technique to avoid out-of-vocabulary words by allowing parts of words to be represented separately. For example, instead of having separate entries in your vocabulary for –"fear", "fearless", "fearsome", "some", and "less"–, you can simply store –"fear", "some", and "less"– then allow your tokenizer to combine these subwords when needed. This allows it to be more flexible so you won't have to save uncommon words explicitly in your vocabulary (e.g. stylebender, nonce, etc). Tokenizing is done with the `trax.data.Tokenize()` command and we have provided you the combined subword vocabulary for English and German (i.e. `ende_32k.subword`) retrieved from https://storage.googleapis.com/trax-ml/vocabs/ende_32k.subword (I'm using the web-interface, but you could also just download it and put it in a directory).

VOCAB_FILE = 'ende_32k.subword'
VOCAB_DIR = "gs://trax-ml/vocabs/" # google storage

# Tokenize the dataset.
tokenized_train_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(train_stream)
tokenized_eval_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(eval_stream)

Append an end-of-sentence token to each sentence: We will assign a token (i.e. in this case 1) to mark the end of a sentence. This will be useful in inference/prediction so we'll know that the model has completed the translation.

Integer assigned as end-of-sentence (EOS)

EOS = 1
def append_eos(stream):
    """helper to add end of sentence token to sentences in the stream

    Yields:
     next tuple of numpy arrays with EOS token added (inputs, targets)
    """
    for (inputs, targets) in stream:
        inputs_with_eos = list(inputs) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        yield numpy.array(inputs_with_eos), numpy.array(targets_with_eos)
    return
tokenized_train_stream = append_eos(tokenized_train_stream)
tokenized_eval_stream = append_eos(tokenized_eval_stream)

Filter long sentences

We will place a limit on the number of tokens per sentence to ensure we won't run out of memory. This is done with the trax.data.FilterByLength() method and you can see its syntax below.

Filter too long sentences to not run out of memory. length_keys=[0, 1] means we filter both English and German sentences, so both must not be longer that 256 tokens for training and 512 tokens for evaluation.

filtered_train_stream = trax.data.FilterByLength(
    max_length=256, length_keys=[0, 1])(tokenized_train_stream)
filtered_eval_stream = trax.data.FilterByLength(
    max_length=512, length_keys=[0, 1])(tokenized_eval_stream)
train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red' ), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)
[31mSingle tokenized example input:[0m [ 2538  2248    30 12114 23184 16889     5     2 20852  6456 20592  5812
  3932    96  5178  3851    30  7891  3550 30650  4729   992     1]
[31mSingle tokenized example target:[0m [ 1872    11  3544    39  7019 17877 30432    23  6845    10 14222    47
  4004    18 21674     5 27467  9513   920   188 10630    18  3550 30650
  4729   992     1]

tokenize & detokenize helper functions

Given any data set, you have to be able to map words to their indices, and indices to their words. The inputs and outputs to your trax models are usually tensors of numbers where each number corresponds to a word. If you were to process your data manually, you would have to make use of the following:

  • word2Ind: a dictionary mapping the word to its index.
  • ind2Word: a dictionary mapping the index to its word.
  • word2Count: a dictionary mapping the word to the number of times it appears.
  • num_words: total number of words that have appeared.
def tokenize(input_str: str,
             vocab_file: str=None, vocab_dir: str=None, EOS: int=EOS) -> numpy.ndarray:
    """Encodes a string to an array of integers

    Args:
       input_str: human-readable string to encode
       vocab_file: filename of the vocabulary text file
       vocab_dir: path to the vocabulary file

    Returns:
       tokenized version of the input string
    """
    # Use the trax.data.tokenize method. It takes streams and returns streams,
    # we get around it by making a 1-element stream with `iter`.
    inputs =  next(trax.data.tokenize(iter([input_str]),
                                      vocab_file=vocab_file,
                                      vocab_dir=vocab_dir))

    # Mark the end of the sentence with EOS
    inputs = list(inputs) + [EOS]

    # Adding the batch dimension to the front of the shape
    batch_inputs = numpy.reshape(numpy.array(inputs), [1, -1])

    return batch_inputs
def detokenize(integers: numpy.ndarray,
               vocab_file: str=None,
               vocab_dir: str=None,
               EOS: int=EOS) -> str:
    """Decodes an array of integers to a human readable string

    Args:
       integers: array of integers to decode
       vocab_file: filename of the vocabulary text file
       vocab_dir: path to the vocabulary file

    Returns:
       str: the decoded sentence.
    """
    # Remove the dimensions of size 1
    integers = list(numpy.squeeze(integers))

    # Remove the EOS to decode only the original tokens
    if EOS in integers:
        integers = integers[:integers.index(EOS)] 

    return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

Let's see how we might use these functions:

Detokenize an input-target pair of tokenized sentences

print(colored(f'Single detokenized example input:', 'red'), detokenize(train_input, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f'Single detokenized example target:', 'red'), detokenize(train_target, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print()
[31mSingle detokenized example input:[0m During treatment with olanzapine, adolescents gained significantly more weight compared with adults.

[31mSingle detokenized example target:[0m Während der Behandlung mit Olanzapin nahmen die Jugendlichen im Vergleich zu Erwachsenen signifikant mehr Gewicht zu.

Tokenize and detokenize a word that is not explicitly saved in the vocabulary file. See how it combines the subwords – 'hell' and 'o'– to form the word 'hello'.

print(colored("tokenize('hello'): ", 'green'), tokenize('hello', vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored("detokenize([17332, 140, 1]): ", 'green'), detokenize([17332, 140, 1], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
[32mtokenize('hello'): [0m [[17332   140     1]]
[32mdetokenize([17332, 140, 1]): [0m hello

Bucketing

Bucketing the tokenized sentences is an important technique used to speed up training in NLP. Here is a nice article describing it in detail but the gist is very simple. Our inputs have variable lengths and you want to make these the same when batching groups of sentences together. One way to do that is to pad each sentence to the length of the longest sentence in the dataset. This might lead to some wasted computation though. For example, if there are multiple short sentences with just two tokens, do we want to pad these when the longest sentence is composed of a 100 tokens? Instead of padding with 0s to the maximum length of a sentence each time, we can group our tokenized sentences by length and bucket.

We batch the sentences with similar length together and only add minimal padding to make them have equal length (usually up to the nearest power of two). This allows us to waste less computation when processing padded sequences.

In Trax, it is implemented in the bucket_by_length function.

Bucketing to create streams of batches.

Buckets are defined in terms of boundaries and batch sizes. Batch_sizes[i] determines the batch size for items with length < boundaries[i]. So below, we'll take a batch of 256 sentences of length < 8, 128 if length is between 8 and 16, and so on – and only 2 if length is over 512. We'll do the bucketing using bucket_by_length.

boundaries = [2**power_of_two for power_of_two in range(3, 10)]
batch_sizes = [2**power_of_two for power_of_two in range(8, 0, -1)]

Create the generators.

train_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]
)(filtered_eval_stream)

Add masking for the padding (0s) using add_loss_weights (we're using AddLossWeights but the documentation for that just says "see add_loss_weights"). I can't find any documentation for it, but I think the 0's are what BucketByLength uses for padding.

train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)

Exploring the data

We will now be displaying some of our data. You will see that the functions defined above (i.e. tokenize() and detokenize()) do the same things you have been doing again and again throughout the specialization. We gave these so you can focus more on building the model from scratch. Let us first get the data generator and get one batch of the data.

input_batch, target_batch, mask_batch = next(train_batch_stream)

Let's see the data type of a batch.

print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))
input_batch data type:  <class 'numpy.ndarray'>
target_batch data type:  <class 'numpy.ndarray'>

Let's see the shape of this particular batch (batch length, sentence length).

print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)
input_batch shape:  (32, 64)
target_batch shape:  (32, 64)

The input_batch and target_batch are Numpy arrays consisting of tokenized English sentences and German sentences respectively. These tokens will later be used to produce embedding vectors for each word in the sentence (so the embedding for a sentence will be a matrix). The number of sentences in each batch is usually a power of 2 for optimal computer memory usage.

We can now visually inspect some of the data. You can run the cell below several times to shuffle through the sentences. Just to note, while this is a standard data set that is used widely, it does have some known wrong translations. With that, let's pick a random sentence and print its tokenized representation.

Pick a random index less than the batch size.

index = random.randrange(len(input_batch))

Use the index to grab an entry from the input and target batch.

print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')
[31mTHIS IS THE ENGLISH SENTENCE: 
[0m Kidneys and urinary tract (no effects were found to be common); uncommon: blood in the urine, proteins in the urine, sugar in the urine; rare: urge to pass urine, kidney pain, passing urine frequently.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: 
 [0m [ 5381 17607  3093     8  8670  6086   105 19166     5    50   154  1743
   152  1103     9    32   568  8076 19124  6847    64  6196     6     4
  8670   510     2 13355   823     6     4  8670   510     2  4968     6
     4  8670   510   115  7227    64  7628     9  2685  8670   510     2
 12220  5509 12095     2 19632  8670   510  7326  3550 30650  4729   992
     1     0     0     0] 

[31mTHIS IS THE GERMAN TRANSLATION: 
[0m Harndrang, Nierenschmerzen, häufiges Wasserlassen.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: 
[0m [ 5135 14970  2920     2  6262  4594 27552    28     2 20052    33  3736
   530  3550 30650  4729   992     1     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 

Bundle it Up

Imports

# python
from collections import namedtuple
from pathlib import Path

# pypi
import attr
import numpy
import trax

Constants

DataDefaults = namedtuple("DataDefaults",
                          ["path",
                           "dataset",
                           "keys",
                           "evaluation_size",
                           "end_of_sentence",
                           "vocabulary_file",
                           "vocabulary_path",
                           "length_keys",
                           "boundaries",
                           "batch_sizes",
                           "padding_token"])

DEFAULTS = DataDefaults(
    path=Path("~/data/tensorflow/translation/").expanduser(),
    dataset="opus/medical",
    keys=("en", "de"),
    evaluation_size=0.01,
    end_of_sentence=1,
    vocabulary_file="ende_32k.subword",
    vocabulary_path="gs://trax-ml/vocabs/",
    length_keys=[0, 1],
    boundaries=[2**power_of_two for power_of_two in range(3, 10)],
    batch_sizes=[2**power_of_two for power_of_two in range(8, 0, -1)],
    padding_token=0,
)

MaxLength = namedtuple("MaxLength", "train evaluate".split())
MAX_LENGTH = MaxLength(train=256, evaluate=512)
END_OF_SENTENCE = 1

Tokenizer/Detokenizer

Tokenizer

def tokenize(input_str: str,
             vocab_file: str=None, vocab_dir: str=None,
             end_of_sentence: int=DEFAULTS.end_of_sentence) -> numpy.ndarray:
    """Encodes a string to an array of integers

    Args:
       input_str: human-readable string to encode
       vocab_file: filename of the vocabulary text file
       vocab_dir: path to the vocabulary file
       end_of_sentence: token for the end of sentence
    Returns:
       tokenized version of the input string
    """
    # The trax.data.tokenize method takes streams and returns streams,
    # we get around it by making a 1-element stream with `iter`.
    inputs =  next(trax.data.tokenize(iter([input_str]),
                                      vocab_file=vocab_file,
                                      vocab_dir=vocab_dir))

    # Mark the end of the sentence with EOS
    inputs = list(inputs) + [end_of_sentence]

    # Adding the batch dimension to the front of the shape
    batch_inputs = numpy.reshape(numpy.array(inputs), [1, -1])
    return batch_inputs

Detokenizer

def detokenize(integers: numpy.ndarray,
               vocab_file: str=None,
               vocab_dir: str=None,
               end_of_sentence: int=DEFAULTS.end_of_sentence) -> str:
    """Decodes an array of integers to a human readable string

    Args:
       integers: array of integers to decode
       vocab_file: filename of the vocabulary text file
       vocab_dir: path to the vocabulary file
       end_of_sentence: token to mark the end of a sentence
    Returns:
       str: the decoded sentence.
    """
    # Remove the dimensions of size 1
    integers = list(numpy.squeeze(integers))

    # Remove the EOS to decode only the original tokens
    if end_of_sentence in integers:
        integers = integers[:integers.index(end_of_sentence)] 

    return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

Data Generator

@attr.s(auto_attribs=True)
class DataGenerator:
    """Generates the streams of data

    Args:
     training: whether this generates training data or not
     path: path to the data set
     data_set: name of the data set (from tensorflow datasets)
     keys: the names of the data
     max_length: longest allowed set of tokens
     evaluation_fraction: how much of the data is saved for evaluation
     length_keys: keys (indexes) to use when setting length
     boundaries: upper limits for batch sizes
     batch_sizes: batch_size for each boundary
     padding_token: which token is used for padding
     vocabulary_file: name of the sub-words vocabulary file
     vocabulary_path: where to find the vocabulary file
     end_of_sentence: token to indicate the end of a sentence
    """
    training: bool=True
    path: Path=DEFAULTS.path
    data_set: str=DEFAULTS.dataset
    keys: tuple=DEFAULTS.keys
    max_length: int=MAX_LENGTH.train
    length_keys: list=DEFAULTS.length_keys
    boundaries: list=DEFAULTS.boundaries
    batch_sizes: list=DEFAULTS.batch_sizes
    evaluation_fraction: float=DEFAULTS.evaluation_size
    vocabulary_file: str=DEFAULTS.vocabulary_file
    vocabulary_path: str=DEFAULTS.vocabulary_path
    padding_token: int=DEFAULTS.padding_token
    end_of_sentence: int=DEFAULTS.end_of_sentence
    _generator_function: type=None
    _batch_generator: type=None

Append End of Sentence

def end_of_sentence_generator(self, original):
    """Generator that adds end of sentence tokens

    Args:
     original: generator to add the end of sentence tokens to

    Yields:
     next tuple of arrays with EOS token added
    """
    for inputs, targets in original:
        inputs = list(inputs) + [self.end_of_sentence]
        targets = list(targets) + [self.end_of_sentence]
        yield numpy.array(inputs), numpy.array(targets)
    return 

Generator Function

@property
def generator_function(self):
    """Function to create the data generator"""
    if self._generator_function is None:
        self._generator_function = trax.data.TFDS(self.data_set,
                                                  data_dir=self.path,
                                                  keys=self.keys,
                                                  eval_holdout_size=self.evaluation_fraction,
                                                  train=self.training)
    return self._generator_function

Batch Stream

@property
def batch_generator(self):
    """batch data generator"""
    if self._batch_generator is None:
        generator = self.generator_function()
        generator = trax.data.Tokenize(
            vocab_file=self.vocabulary_file,
            vocab_dir=self.vocabulary_path)(generator)
        generator = self.end_of_sentence_generator(generator)
        generator = trax.data.FilterByLength(
            max_length=self.max_length,
            length_keys=self.length_keys)(generator)
        generator = trax.data.BucketByLength(
            self.boundaries, self.batch_sizes,
            length_keys=self.length_keys
        )(generator)
        self._batch_generator = trax.data.AddLossWeights(
            id_to_mask=self.padding_token)(generator)
    return self._batch_generator

Try It Out

from neurotic.nlp.machine_translation import DataGenerator, detokenize

generator = DataGenerator().batch_generator
input_batch, target_batch, mask_batch = next(generator)
index = random.randrange(len(batch))


print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')
[31mTHIS IS THE ENGLISH SENTENCE: 
[0m Signs of hypersensitivity reactions include hives, generalised urticaria, tightness of the chest, wheezing, hypotension and anaphylaxis.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: 
 [0m [10495    14     7 10224 19366 10991  1020  3481  2486     2  9547  7417
   103  4572 11927  9371     2 13197  1496     7     4 24489    62     2
 16402 24010   211     2  4814 23010 12122    22     8  4867 19606  6457
  5175    14  3550 30650  4729   992     1     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 

[31mTHIS IS THE GERMAN TRANSLATION: 
[0m Überempfindlichkeitsreaktionen können sich durch Anzeichen wie Nesselausschlag, generalisierte Urtikaria, Engegefühl im Brustkorb, Pfeifatmung, Blutdruckabfall und Anaphylaxie äußern.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: 
[0m [ 3916 29551 13504  5020  4094 13522   119    51   121  8602    93 31508
  6050 30327  6978     2  9547  7417  2446  5618  4581  5530  1384     2
 26006  7831 13651     5    47  8584  4076  5262   868     2 25389  8898
 28268     2  9208 29697 17944    83    12  9925 19606  6457 16384     5
 11790  3550 30650  4729   992     1     0     0     0     0     0     0
     0     0     0     0] 

End

Now that we have our data prepared it's time to move on to defining the Attention Model.

Neural Machine Translation

Neural Machine Translations

Here, we will build an English-to-German neural machine translation (NMT) model using Long Short-Term Memory (LSTM) networks with attention. Machine translation is an important task in natural language processing and could be useful not only for translating one language to another but also for word sense disambiguation (e.g. determining whether the word "bank" refers to the financial bank, or the land alongside a river). Implementing this using just a Recurrent Neural Network (RNN) with LSTMs can work for short to medium length sentences but can result in vanishing gradients for very long sequences. To solve this, we will be adding an attention mechanism to allow the decoder to access all relevant parts of the input sentence regardless of its length. By completing this assignment, we will:

  • learn how to preprocess your training and evaluation data
  • implement an encoder-decoder system with attention
  • understand how attention works
  • build the NMT model from scratch using Trax
  • generate translations using greedy and Minimum Bayes Risk (MBR) decoding

The Posts

This will be broken up into the following posts.

First - a look at the data.

Stack Semantics

Stack Semantics in Trax

This will help in understanding how to use layers like Select and Residual which operate on elements in the stack. If you've taken a computer science class before, you will recall that a stack is a data structure that follows the Last In, First Out (LIFO) principle. That is, whatever is the latest element that is pushed into the stack will also be the first one to be popped out. If you're not yet familiar with stacks, then you may find this short tutorial useful. In a nutshell, all you really need to remember is it puts elements one on top of the other. You should be aware of what is on top of the stack to know which element you will be popping.

Imports

# pypi
import numpy
from trax import fastmath, layers, shapes

Middle

The Serial Combinator is Stack Oriented.

To understand how stack-orientation works in Trax, most times one will be using the Serial layer. We will define two simple Function layers:

  1. Addition
  2. Multiplication

Suppose we want to make the simple calculation \((3 + 4) \times 15 + 3\). We'll use Serial to perform the calculations in the following order 3 4 add 15 mul 3 add. The steps of the calculation are shown in the table below.

Stack Operations Stack
Push(4) 4
Push(3) 4 3
Push(Add Pop() Pop()) 7
Push(15) 7 15
Push(Mul Pop() Pop()) 105
Push(3) 105 3
Push(Add() Pop() Pop()) 108

The first column shows the operations made on the stack and the second column is what's on the stack. Moreover, the rightmost element in the second column represents the top of the stack (e.g. in the second row, Push(3) pushes 3 = on top of the stack and =4 is now under it).

After finishing the steps the stack contains 108 which is the answer to our simple computation.

From this, the following can be concluded: a stack-based layer has only one way to handle data, by taking one piece of data from atop the stack, called popping, and putting data back atop the stack, called pushing. Any expression that can be written conventionally, can be written this way and thus will be amenable to being interpreted by a stack-oriented layer like Serial.

Defining addition

We're going to define a trax function (FN) for addition.

def Addition():
    layer_name = "Addition" 

    def func(x, y):
        return x + y

    return layers.Fn(layer_name, func)

Test it out.

add = Addition()
print(type(add))
<class 'trax.layers.base.PureLayer'>
print("name :", add.name)
print("expected inputs :", add.n_in)
print("promised outputs :", add.n_out)
name : Addition
expected inputs : 2
promised outputs : 1
x = numpy.array([3])
y = numpy.array([4])

print(f"{x} + {y} = {add((x, y))}")
[3] + [4] = [7]

Defining multiplication

def Multiplication():
    layer_name = "Multiplication"

    def func(x, y):
        return x * y

    return layers.Fn(layer_name, func)

Test it out.

mul = Multiplication()

The properties.

print("name :", mul.name)
print("expected inputs :", mul.n_in)
print("promised outputs :", mul.n_out, "\n")
name : Multiplication
expected inputs : 2
promised outputs : 1 

Some Inputs.

x = numpy.array([7])
y = numpy.array([15])
print("x :", x)
print("y :", y)
x : [7]
y : [15]

The Output

z = mul((x, y))
print(f"{x} * {y} = {mul((x, y))}")
[7] * [15] = [105]

Implementing the computations using the Serial combinator

serial = layers.Serial(
    Addition(), Multiplication(), Addition()
)
inputs = (numpy.array([3]), numpy.array([4]), numpy.array([15]), numpy.array([3]))

serial.init(shapes.signature(inputs))
print(serial, "\n")
print("name :", serial.name)
print("sublayers :", serial.sublayers)
print("expected inputs :", serial.n_in)
print("promised outputs :", serial.n_out, "\n")
Serial_in4[
  Addition_in2
  Multiplication_in2
  Addition_in2
] 

name : Serial
sublayers : [Addition_in2, Multiplication_in2, Addition_in2]
expected inputs : 4
promised outputs : 1 
print(f"{inputs} -> {serial(inputs)}")
(array([3]), array([4]), array([15]), array([3])) -> [108]

The example with the two simple adition and multiplication functions that where coded together with the serial combinator show how stack semantics work in Trax.

The tl.Select combinator in the context of the Serial combinator

Having understood how stack semantics work in Trax, we will demonstrate how the tl.Select combinator works.

First example of tl.Select

Suppose we want to make the simple calculation \((3 + 4) \times 3 + 4\). We can use Select to perform the calculations in the following manner:

  1. input 3 4
  2. tl.Select([0, 1, 0, 1])
  3. add
  4. mul
  5. add.

The tl.Select requires a list or tuple of 0-based indices to select elements relative to the top of the stack. For our example, the top of the stack is 3 (which is at index 0) then 4 (index 1) and we us Select to copy the top two elements of the stack and then push all four elements back onto the stack which after the command executes will now contain 3 4 3 4. The steps of the calculation for our example are shown in the table below. As in the previous table each column shows the contents of the stack and the outputs after the operations are carried out.

Stack Operations Stack
Push(4) 4
Push(3) 4 3
Push(Select([0, 1, 0, 1])) 4 3 4 3
Push(Add Pop() Pop()) 4 3 7
Push(Mul Pop() Pop()) 4 21
Push(Add Pop() Pop()) 25

After processing all the inputs the stack contains 25 which is the result of the calculations.

serial = layers.Serial(
    layers.Select([0, 1, 0, 1]),
    Addition(),
    Multiplication(),
    Addition()
)

Now we'll create the input.

x = (numpy.array([3]), numpy.array([4]))
serial.init(shapes.signature(x))
print(serial, "\n")
print("name :", serial.name)
print("sublayers :", serial.sublayers)
print("expected inputs :", serial.n_in)
print("promised outputs :", serial.n_out, "\n")
Serial_in2[
  Select[0,1,0,1]_in2_out4
  Addition_in2
  Multiplication_in2
  Addition_in2
] 

name : Serial
sublayers : [Select[0,1,0,1]_in2_out4, Addition_in2, Multiplication_in2, Addition_in2]
expected inputs : 2
promised outputs : 1 
print(f"{x} -> {serial(x)}")
(array([3]), array([4])) -> [25]

Select Makes It More Like a Collection

Note that since you are passing in indices to Select, you aren't really using it like a stack, even if behind the scenes it's using push and pop.

serial = layers.Serial(
    layers.Select([2, 1, 1, 2]),
    Addition(),
    Multiplication(),
    Addition()
)

x = (numpy.array([3]), numpy.array([4]), numpy.array([5]))
serial.init(shapes.signature(x))

print(f"{x} -> {serial(x)}")
(array([3]), array([4]), array([5])) -> [41]
print((5 + 4) * 4 + 5)
41

Another example of tl.Select

Suppose we want to make the simple calculation \((3 + 4) \times 4\). We can use Select to perform the calculations in the following manner:

  1. 4
  2. 3
  3. tl.Select([0,1,0,1])
  4. add
  5. tl.Select([0], n_in=2)
  6. mul

The example is a bit contrived but it demonstrates the flexibility of the command. The second tl.Select pops two elements (specified in n_in) from the stack starting from index 0 (i.e. top of the stack). This means that 7 and 3 = will be popped out because ~n_in = 2~) but only =7 is placed back on top because it only selects [0]. As in the previous table each column shows the contents of the stack and the outputs after the operations are carried out.

Stack Operations Outputs
Push(4) 4
Push(3) 4 3
Push(select([0, 1, 0, 1])) 4 3 4 3
Push(Add Pop() Pop()) 4 3 7
Push(select([0], n_in=2)) 7
Push(Mul Pop() Pop()) 28

After processing all the inputs the stack contains 28 which is the answer we get above.

serial = layers.Serial(
    layers.Select([0, 1, 0, 1]),
    Addition(),
    layers.Select([0], n_in=2),
    Multiplication()
)
inputs = (numpy.array([3]), numpy.array([4]))
serial.init(shapes.signature(inputs))
print(serial, "\n")
print("name :", serial.name)
print("sublayers :", serial.sublayers)
print("expected inputs :", serial.n_in)
print("promised outputs :", serial.n_out)
Serial_in2[
  Select[0,1,0,1]_in2_out4
  Addition_in2
  Select[0]_in2
  Multiplication_in2
] 

name : Serial
sublayers : [Select[0,1,0,1]_in2_out4, Addition_in2, Select[0]_in2, Multiplication_in2]
expected inputs : 2
promised outputs : 1
print(f"{inputs} -> {serial(inputs)}")
(array([3]), array([4])) -> [28]

In summary, what Select does in this example is make a copy of the inputs in order to be used further along in the stack of operations.

The tl.Residual combinator in the context of the Serial combinator

tl.Residual

Residual networks (that link is to a research paper, this is wikipedia)are frequently used to make deep models easier to train. Trax already has a built in layer for this. The Residual layer computes the element-wise sum of the stack-top input with the output of the layer series. Let's first see how it is used in the code below:

serial = layers.Serial(
    layers.Select([0, 1, 0, 1]),
    layers.Residual(Addition())
)

print(serial, "\n")
print("name :", serial.name)
print("expected inputs :", serial.n_in)
print("promised outputs :", serial.n_out)
Serial_in2_out3[
  Select[0,1,0,1]_in2_out4
  Serial_in2[
    Branch_in2_out2[
      None
      Addition_in2
    ]
    Add_in2
  ]
] 

name : Serial
expected inputs : 2
promised outputs : 3

Here, we use the Serial combinator to define our model. The inputs first goes through a Select layer, followed by a Residual layer which passes the Fn: Addition() layer as an argument. What this means is the Residual layer will take the stack top input at that point and add it to the output of the Fn: Addition() layer. You can picture it like the diagram the below, where x1 and x2 are the inputs to the model:

Now, let's try running our model with some sample inputs and see the result:

x1 = numpy.array([3])
x2 = numpy.array([4])

print(f"{x1} + {x2} -> {serial((x1, x2))}")
[3] + [4] -> (array([10]), array([3]), array([4]))

As you can see, the Residual layer remembers the stack top input (i.e. 3) and adds it to the result of the Fn: Addition() layer (i.e. 3 + 4 = 7). The output of Residual(Addition() is then 3 + 7 = 10 and is pushed onto the stack.

On a different note, you'll notice that the Select layer has 4 outputs but the Fn: Addition() layer only pops 2 inputs from the stack. This means the duplicate inputs (i.e. the 2 rightmost arrows of the Select outputs in the figure above) remain in the stack. This is why you still see it in the output of our simple serial network (i.e. array([3]), array([4])). This is useful if you want to use these duplicate inputs in another layer further down the network.

Modifying the network

To strengthen your understanding, you can modify the network above and examine the outputs you get. For example, you can pass the Fn: Multiplication() layer instead in the Residual block:

serial = layers.Serial(
    layers.Select([0, 1, 0, 1]), 
    layers.Residual(Multiplication())
)

print(serial, "\n")
print("name :", serial.name)
print("expected inputs :", serial.n_in)
print("promised outputs :", serial.n_out)
Serial_in2_out3[
  Select[0,1,0,1]_in2_out4
  Serial_in2[
    Branch_in2_out2[
      None
      Multiplication_in2
    ]
    Add_in2
  ]
] 

name : Serial
expected inputs : 2
promised outputs : 3

This means you'll have a different output that will be added to the stack top input saved by the Residual block. The diagram becomes like this:

And you'll get 3 + (3 * 4) = 15 as output of the Residual block:

x1 = numpy.array([3])
x2 = numpy.array([4])

y = serial((x1, x2))
print(f"{x1} * {x2} -> {serial((x1, x2))}")
[3] * [4] -> (array([15]), array([3]), array([4]))

Bleu Score

Calculating the Bilingual Evaluation Understudy (BLEU) score

We will be implementing a popular metric for evaluating the quality of machine-translated text: the BLEU score proposed by Kishore Papineni, et al. In their 2002 paper "BLEU: a Method for Automatic Evaluation of Machine Translation", the BLEU score works by comparing "candidate" text to one or more "reference" translations. The result is better the closer the score is to 1. Let's see how to get this value in the following sections.

Imports

# python
from collections import Counter, namedtuple
from functools import partial
from pathlib import Path

import math
import os

# from pypi
from dotenv import load_dotenv
from nltk.util import ngrams

import hvplot.pandas
import numpy
import nltk
import sacrebleu
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

nltk.download('punkt')
slug = "bleu-score"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{slug}")

Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

load_dotenv("posts/nlp/.env", override=True)

Middle

Part 1: BLEU Score

We will implement our own version of the BLEU Score using Numpy. To verify that our implementation is correct, we will compare our results with those generated by the SacreBLEU library. This package provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. It also knows all the standard test sets and handles downloading, processing, and tokenization.

Defining the BLEU Score

We can express the BLEU score as:

\[ BLEU = BP\left(\prod_{i=1}^{4}precision_i\right)^{(1/4)} \]

with the Brevity Penalty and precision defined as:

\[ BP = min\left(1, e^{(1-(\textit{reference}/\textit{candidate}))}\right) \]

\[ precision_i = \frac {\sum_{snt \in{cand}}\sum_{i\in{snt}}min\Bigl(m^{i}_{cand}, m^{i}_{ref}\Bigr)}{w^{i}_{t}} \]

where:

  • \(m^{i}_{cand}\), is the count of i-gram in candidate matching the reference translation.
  • \(m^{i}_{ref}\), is the count of i-gram in the reference translation.
  • \(w^{i}_{t}\), is the total number of i-grams in candidate translation.

Explaining the BLEU score

  • Brevity Penalty (example)
    ref_length = numpy.ones(100)
    can_length = numpy.linspace(1.5, 0.5, 100)
    x = ref_length / can_length
    y = 1 - x
    y = numpy.exp(y)
    brevity_penalty = numpy.minimum(numpy.ones(y.shape), y)
    
    frame = pandas.DataFrame.from_dict({"Reference Length/Candidate Length": x
                                        , "Brevity Penalty": brevity_penalty})
    plot = frame.hvplot(x="Reference Length/Candidate Length",
                        y="Brevity Penalty", title="Brevity Penalty").opts(
        width=PLOT.width,
        height=PLOT.height,
        fontscale=PLOT.fontscale    
    )
    output = Embed(plot=plot, file_name="brevity_penalty")()
    
    print(output)
    

    Figure Missing

    The brevity penalty penalizes generated translations that are too short compared to the closest reference length with an exponential decay. The brevity penalty compensates for the fact that the BLEU score has no recall term.

N-Gram Precision (example)

And now for a meaningless plot.

data = pandas.DataFrame.from_dict({"1-gram": [0.8],
                                   "2-gram": [0.7],
                                   "3-gram": [0.6],
                                   "4-gram": [0.5]})
plot = data.hvplot.bar(title="N-Gram Precision").opts(
    width=PLOT.width,
    height=PLOT.height,
    fontscale=PLOT.fontscale,
)

output = Embed(plot=plot, file_name="n_gram_precision")()
print(output)

Figure Missing

The n-gram precision counts how many unigrams, bigrams, trigrams, and four-grams (i=1,…,4) match their n-gram counterpart in the reference translations. This term acts as a precision metric. Unigrams account for adequacy while longer n-grams account for fluency of the translation. To avoid overcounting, the n-gram counts are clipped to the maximal n-gram count occurring in the reference (\(m_{n}^{ref}\)). Typically precision shows exponential decay with the with the degree of the n-gram.

N-gram BLEU score (example):

Another meaningless plot.

data = pandas.DataFrame.from_dict({"1-gram": [0.8],
                                   "2-gram": [0.77],
                                   "3-gram": [0.74],
                                   "4-gram": [0.71]})
plot = data.hvplot.bar(title="Modified N-Gram Precision").opts(
    width=PLOT.width,
    height=PLOT.height,
    fontscale=PLOT.fontscale
)

output = Embed(plot=plot, file_name="modified_n_gram_precision")()
print(output)

Figure Missing

When the n-gram precision is multiplied by the BP, then the exponential decay of n-grams is almost fully compensated. The BLEU score corresponds to a geometric average of this modified n-gram precision.

Example Calculations of the BLEU score

In this example we will have a reference translation and 2 candidates translations. We will tokenize all sentences using the NLTK.

  • Step 1: Computing the Brevity Penalty
    def brevity_penalty(candidate: list, reference: list) -> numpy.ndarray:
        """Calculates the brevity penalty"""
        reference_length = len(reference)
        candidate_length = len(candidate)
    
        # Brevity Penalty
        return 1 if reference_length < candidate_length else numpy.exp( 1 - (reference_length / candidate_length))
    
  • Step 2: Computing the Precision
    def clipped_precision(candidate: list, reference: list) -> numpy.ndarray:
        """
        Clipped precision function given a original and a machine translated sentences
        """
        clipped_precision_score = []
    
        for i in range(1, 5):
            ref_n_gram = Counter(ngrams(reference,i))
            cand_n_gram = Counter(ngrams(candidate,i))
    
            c = sum(cand_n_gram.values())
    
            for j in cand_n_gram: # for every n-gram up to 4 in candidate text
                if j in ref_n_gram: # check if it is in the reference n-gram
                    if cand_n_gram[j] > ref_n_gram[j]: # if the count of the candidate n-gram is bigger
                                                       # than the corresponding count in the reference n-gram,
                        cand_n_gram[j] = ref_n_gram[j] # then set the count of the candidate n-gram to be equal
                                                       # to the reference n-gram
                else:
                    cand_n_gram[j] = 0 # else set the candidate n-gram equal to zero
    
            clipped_precision_score.append(sum(cand_n_gram.values())/c)
    
        weights =[0.25] * 4
    
        s = (w_i * math.log(p_i) for w_i, p_i in zip(weights, clipped_precision_score))
        s = math.exp(math.fsum(s))
        return s
    
  • Step 3: Computing the BLEU score
    def bleu_score(candidate: list, reference: list) -> numpy.ndarray:
        BP = brevity_penalty(candidate, reference)
        precision = clipped_precision(candidate, reference)
        return BP * precision
    
  • Step 4: Testing with our Example Reference and Candidates Sentences
    reference = "The NASA Opportunity rover is battling a massive dust storm on planet Mars."
    candidate_1 = "The Opportunity rover is combating a big sandstorm on planet Mars."
    candidate_2 = "A NASA rover is fighting a massive storm on planet Mars."
    
    tokenized_ref = nltk.word_tokenize(reference.lower())
    tokenized_cand_1 = nltk.word_tokenize(candidate_1.lower())
    tokenized_cand_2 = nltk.word_tokenize(candidate_2.lower())
    
    print(
        "Results reference versus candidate 1 our own code BLEU: ",
        round(bleu_score(tokenized_cand_1, tokenized_ref) * 100, 1),
    )
    
    Results reference versus candidate 1 our own code BLEU:  27.6
    
    print(
        "Results reference versus candidate 2 our own code BLEU: ",
        round(bleu_score(tokenized_cand_2, tokenized_ref) * 100, 1),
    )
    
    Results reference versus candidate 2 our own code BLEU:  35.3
    
  • Step 5: Comparing the Results from our Code with the SacreBLEU Library
    print(
        "Results reference versus candidate 1 sacrebleu library BLEU: ",
        round(sacrebleu.corpus_bleu(candidate_1, reference).score, 1),
    )
    
    Results reference versus candidate 1 sacrebleu library BLEU:  27.6
    
    print(
        "Results reference versus candidate 2 sacrebleu library BLEU: ",
        round(sacrebleu.corpus_bleu(candidate_2, reference).score, 1),
    )
    
    Results reference versus candidate 2 sacrebleu library BLEU:  35.3
    

Part 2: BLEU computation on a corpus

Loading Data Sets for Evaluation Using the BLEU Score

In this section, we will show a simple pipeline for evaluating machine translated text. Due to storage and speed constraints, we will not be using our own model in this lab. Instead, we will be using Google Translate to generate English to German translations and we will evaluate it against a known evaluation set. There are three files we will need:

  1. A source text in English. In this lab, we will use the first 1671 words of the wmt19 evaluation dataset downloaded via SacreBLEU. We just grabbed a subset because of limitations in the number of words that can be translated using Google Translate.
  2. A reference translation to German of the corresponding first 1671 words from the original English text. This is also provided by SacreBLEU.
  3. A candidate machine translation to German from the same 1671 words. This is generated by feeding the source text to a machine translation model. As mentioned above, we will use Google Translate to generate the translations in this file.

With that, we can now compare the reference an candidate translation to get the BLEU Score.

Load the raw data.

with Path(os.environ["WMT19_SOURCE"]).expanduser().open(encoding="utf-8") as reader:
    wmt19_src_1 = reader.read()

with Path(os.environ["WMT19_REFERENCE"]).expanduser().open(encoding="utf-8") as reader:
    wmt19_ref_1 = reader.read()

with Path(os.environ["WMT19_CANDIDATE"]).expanduser().open(encoding="utf-8") as reader:
    wmt19_can_1 = reader.read()

tokenized_corpus_src = nltk.word_tokenize(wmt19_src_1.lower())
tokenized_corpus_ref = nltk.word_tokenize(wmt19_ref_1.lower())
tokenized_corpus_cand = nltk.word_tokenize(wmt19_can_1.lower())    

Inspecting the first sentence of the data.

print("English source text:\n")
print(f"{wmt19_src_1[0:170]} -> {tokenized_corpus_src[0:30]}\n")
print("German reference translation:\n")
print(f"{wmt19_ref_1[0:219]} -> {tokenized_corpus_ref[0:35]}\n")
print("German machine translation:\n")
print(f"{wmt19_can_1[0:199]} -> {tokenized_corpus_cand[0:29]}")
English source text:

Welsh AMs worried about 'looking like muppets'
There is consternation among some AMs at a suggestion their title should change to MWPs (Member of the Welsh Parliament).
 -> ['\ufeffwelsh', 'ams', 'worried', 'about', "'looking", 'like', "muppets'", 'there', 'is', 'consternation', 'among', 'some', 'ams', 'at', 'a', 'suggestion', 'their', 'title', 'should', 'change', 'to', 'mwps', '(', 'member', 'of', 'the', 'welsh', 'parliament', ')', '.']

German reference translation:

Walisische Ageordnete sorgen sich "wie Dödel auszusehen"
Es herrscht Bestürzung unter einigen Mitgliedern der Versammlung über einen Vorschlag, der ihren Titel zu MWPs (Mitglied der walisischen Parlament) ändern soll.
 -> ['\ufeffwalisische', 'ageordnete', 'sorgen', 'sich', '``', 'wie', 'dödel', 'auszusehen', "''", 'es', 'herrscht', 'bestürzung', 'unter', 'einigen', 'mitgliedern', 'der', 'versammlung', 'über', 'einen', 'vorschlag', ',', 'der', 'ihren', 'titel', 'zu', 'mwps', '(', 'mitglied', 'der', 'walisischen', 'parlament', ')', 'ändern', 'soll', '.']

German machine translation:

Walisische AMs machten sich Sorgen, dass sie wie Muppets aussehen könnten
Einige AMs sind bestürzt über den Vorschlag, ihren Titel in MWPs (Mitglied des walisischen Parlaments) zu ändern.
Es ist aufg -> ['walisische', 'ams', 'machten', 'sich', 'sorgen', ',', 'dass', 'sie', 'wie', 'muppets', 'aussehen', 'könnten', 'einige', 'ams', 'sind', 'bestürzt', 'über', 'den', 'vorschlag', ',', 'ihren', 'titel', 'in', 'mwps', '(', 'mitglied', 'des', 'walisischen', 'parlaments']
print(
    "Results reference versus candidate 1 our own BLEU implementation: ",
    round(bleu_score(tokenized_corpus_cand, tokenized_corpus_ref) * 100, 1),
)
Results reference versus candidate 1 our own BLEU implementation:  43.6
print(
    "Results reference versus candidate 1 sacrebleu library BLEU: ",
    round(sacrebleu.corpus_bleu(wmt19_can_1, wmt19_ref_1).score, 1),
)
Results reference versus candidate 1 sacrebleu library BLEU:  43.2

BLEU Score Interpretation on a Corpus

Score Interpretation
< 10 Almost useless
10 - 19 Hard to get the gist
20 - 29 The gist is clear, but has significant grammatical errors
30 - 40 Understandable to good translations
40 - 50 High quality translations
50 - 60 Very high quality, adequate, and fluent translations
> 60 Quality often better than human

From the table above (taken from here), we can see the translation is high quality (if you see "Hard to get the gist", please open your workspace, delete `wmt19_can.txt` and get the latest version via the Lab Help button). Moreover, the results of our coded BLEU score are almost identical to those of the SacreBLEU package.

Siamese Networks: New Questions

Trying New Questions

Imports

# python
from pathlib import Path

# pypi
import nltk
import numpy
import pandas
import trax

# this project
from neurotic.nlp.siamese_networks import (
    DataGenerator,
    DataLoader,
    SiameseModel,
    TOKENS,
 )

Set Up

The Data

data_generator = DataGenerator
loader = DataLoader()
vocabulary = loader.vocabulary

The Model

siamese = SiameseModel(len(vocabulary))
path = Path("~/models/siamese_networks/model.pkl.gz").expanduser()
weights = siamese.model.init_from_file(path, weights_only=True)
model = siamese.model

Implementing It

Write a function =predict=that takes in two questions, the model, and the vocabulary and returns whether the questions are duplicates (1) or not duplicates (0) given a similarity threshold.

Instructions:

  • Tokenize your question using `nltk.word_tokenize`
  • Create Q1,Q2 by encoding your questions as a list of numbers using vocab
  • pad Q1,Q2 with next(data_generator([Q1], [Q2],1,vocab['<PAD>']))
  • use model() to create v1, v2
  • compute the cosine similarity (dot product) of v1, v2
  • compute res by comparing d to the threshold
def predict(question1: str, question2: str,
            threshold: float=0.7, model: trax.layers.Parallel=model,
            vocab: dict=vocabulary, data_generator: type=data_generator,
            verbose: bool=True) -> bool:
    """Function for predicting if two questions are duplicates.

    Args:
       question1 (str): First question.
       question2 (str): Second question.
       threshold (float): Desired threshold.
       model (trax.layers.combinators.Parallel): The Siamese model.
       vocab (collections.defaultdict): The vocabulary used.
       data_generator (function): Data generator function. Defaults to data_generator.
       verbose (bool, optional): If the results should be printed out. Defaults to False.

    Returns:
       bool: True if the questions are duplicates, False otherwise.
    """
    question_one = [[vocab[word] for word in nltk.word_tokenize(question1)]]
    question_two = [[vocab[word] for word in nltk.word_tokenize(question2)]]

    questions = next(data_generator(question_one,
                                    question_two,
                                    batch_size=1))
    vector_1, vector_2 = model(questions)
    similarity = float(numpy.dot(vector_1, vector_2.T))
    same_question = similarity > threshold

    if(verbose):
        print(f"Q1  = {questions[0]}")
        print(f"Q2 = {questions[1]}")
        print(f"Similarity : {float(similarity):0.2f}")
        print(f"They are the same question: {same_question}")
    return same_question

Some Trials

print(TOKENS)
Tokens(unknown=0, padding=1, padding_token='<PAD>')

So if we see a 0 in the tokens then we know the word wasn't in the vocabulary.

question1 = "When will I see you?"
question2 = "When can I see you again?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocabulary, verbose = True)
Q1  = [[581  64  20  44  49  16   1   1]]
Q2 = [[ 581   39   20   44   49 7280   16    1]]
Similarity : 0.95
They are the same question: True
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"

predict(question1 , question2, 0.7, model, vocabulary, verbose=True)
Q1  = [[  446  1138  3159  1169    70 29016    16     1]]
Q2 = [[  446  1138    57 15302    24    70  7430    16]]
Similarity : 0.60
They are the same question: False
predict("Do cows have butts?", "Do dogs have bones?")
Q1  = [[  446  5757   216 25442    16     1     1     1]]
Q2 = [[  446   788   216 11192    16     1     1     1]]
Similarity : 0.25
They are the same question: False
predict("Do cows from Lancashire have butts?", "Do dogs have bones as big as whales?")
Q1  = [[  446  5757   125     0   216 25442    16     1     1     1     1     1
      1     1     1     1]]
Q2 = [[  446   788   216 11192   249  1124   249 30836    16     1     1     1
      1     1     1     1]]
Similarity : 0.13
They are the same question: False
predict("Can pigs fly?", "Are you my mother?")
Q1  = [[  221 14137  5750    16     1     1     1     1]]
Q2 = [[ 517   49   41 1585   16    1    1    1]]
Similarity : 0.01
They are the same question: False
predict("Shall we dance?", "Shall I fart?")
Q1  = [[19382   138  4201    16]]
Q2 = [[19382    20 18288    16]]
Similarity : 0.71
They are the same question: True

Hm… surprising that "fart" was in the data set, and it's the same as dancing.

farts = loader.training_data[loader.training_data.question2.str.contains("fart[^a-z]")]
print(len(farts))
print(farts.question2.head())
16
19820                                    Can penguins fart?
60745       How do I control a fart when I'm about to fart?
83124           What word square starts with the word fart?
96707         Which part of human body is called fart pump?
120727    Why do people fart more when they wake up in t...
Name: question2, dtype: object

Maybe I shouldn't have been surprised.

predict("Am I man or gorilla?", "Am I able to eat the pasta?")
Q1  = [[4311   20 1215   75 7438   16    1    1]]
Q2 = [[ 4311    20   461    37   922    70 14552    16]]
Similarity : 0.20
They are the same question: False

It looks like the model only looks at the first words… at least when the sentences are short.

predict("Will we return to Mars or go instead to Venus?", "Will we eat rice with plums and cherry topping?")
Q1  = [[  168   141  8303    34  6861    72  1315  4536    34 15555    16     1
      1     1     1     1]]
Q2 = [[  168   141   927  7612   121     0     9 19275     0    16     1     1
      1     1     1     1]]
Similarity : 0.67
They are the same question: False

Siamese networks are important and useful. Many times there are several questions that are already asked in quora, or other platforms and you can use Siamese networks to avoid question duplicates.

Siamese Networks: Evaluating the Model

Evaluating the Siamese Network

Force CPU Use

For some reason the model eats up more and more memory on the GPU until it runs out. Seems like a memory leak. Anyway, for reasons that I don't know, the way that tensorflow tells you to disable using the GPU doesn't work (it's in the second code block) so to get this to work I have to essentially break the CUDA settings.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

This is the way they tell you to do it.

import tensorflow
tensorflow.config.set_visible_devices([], "GPU")

Imports

# python
from collections import namedtuple
from pathlib import Path

# pypi
import numpy
import trax

# this project
from neurotic.nlp.siamese_networks import (
    DataGenerator,
    DataLoader,
    SiameseModel,
 )

# other
from graeae import Timer

Set Up

The Data

loader = DataLoader()
data = loader.data

vocabulary_length = len(loader.vocabulary)
y_test = data.y_test
testing = data.test

del(loader)
del(data)

The Timer

TIMER = Timer()

The Model

siamese = SiameseModel(vocabulary_length)
path = Path("~/models/siamese_networks/model.pkl.gz").expanduser()
weights = siamese.model.init_from_file(path, weights_only=True)

Classify

To determine the accuracy of the model, we will utilize the test set that was configured earlier. While in training we used only positive examples, the test data, Q1_test, Q2_test and y_test, is setup as pairs of questions, some of which are duplicates some are not.

This routine will run all the test question pairs through the model, compute the cosine simlarity of each pair, threshold it and compare the result to y_test - the correct response from the data set. The results are accumulated to produce an accuracy.

Instructions

  • Loop through the incoming data in batch_size chunks
  • Use the data generator to load q1, q2 a batch at a time. Don't forget to set shuffle=False!
  • copy a batch_size chunk of y into y_test
  • compute v1, v2 using the model
  • for each element of the batch
    • compute the cos similarity of each pair of entries, v1[j],v2[j]
    • determine if d > threshold
    • increment accuracy if that result matches the expected results (y_test[j])
  • compute the final accuracy and return
Outcome = namedtuple("Outcome", ["accuracy", "true_positive",
                                 "true_negative", "false_positive",
                                 "false_negative"])

def classify(data_generator: iter,
             y: numpy.ndarray,
             threshold: float,
             model: trax.layers.Parallel):
    """Function to test the accuracy of the model.

    Args:
      data_generator: batch generator,
      y: Array of actual target.
      threshold: minimum distance to be considered the same
      model: The Siamese model.
    Returns:
       float: Accuracy of the model.
    """
    accuracy = 0
    true_positive = false_positive = true_negative = false_negative = 0
    batch_start = 0

    for batch_one, batch_two in data_generator:
        batch_size = len(batch_one)
        batch_stop = batch_start + batch_size

        if batch_stop >= len(y):
            break
        batch_labels = y[batch_start: batch_stop]
        vector_one, vector_two = model((batch_one, batch_two))
        batch_start = batch_stop

        for row in range(batch_size):
            similarity = numpy.dot(vector_one[row], vector_two[row].T)
            same_question = int(similarity > threshold)
            correct = same_question == batch_labels[row]
            if same_question:
                if correct:
                    true_positive += 1
                else:
                    false_positive += 1
            else:
                if correct:
                    true_negative += 1
                else:
                    false_negative += 1
            accuracy += int(correct)
    return Outcome(accuracy=accuracy/len(y),
                   true_positive = true_positive,
                   true_negative = true_negative,
                   false_positive = false_positive,
                   false_negative = false_negative)
batch_size = 512
data_generator = DataGenerator(testing.question_one, testing.question_two,
                               batch_size=batch_size,
                               shuffle=False)

with TIMER:
    outcome = classify(
        data_generator=data_generator,
        y=y_test,
        threshold=0.7,
        model=siamese.model
    ) 
print(f"Outcome: {outcome}")
Started: 2021-02-10 21:42:27.320674
Ended: 2021-02-10 21:47:57.411380
Elapsed: 0:05:30.090706
Outcome: Outcome(accuracy=0.6546453536874203, true_positive=16439, true_negative=51832, false_positive=14425, false_negative=21240)

So, is that good or not? It might be more useful to look at the rates.

print(f"Accuracy: {outcome.accuracy:0.2f}")
true_positive = outcome.true_positive
false_negative = outcome.false_negative
true_negative = outcome.true_negative
false_positive = outcome.false_positive

print(f"True Positive Rate: {true_positive/(true_positive + false_negative): 0.2f}")
print(f"True Negative Rate: {true_negative/(true_negative + false_positive):0.2f}")
print(f"Precision: {outcome.true_positive/(true_positive + false_positive):0.2f}")
print(f"False Negative Rate: {false_negative/(false_negative + true_positive):0.2f}")
print(f"False Positive Rate: {false_positive/(false_positive + true_negative): 0.2f}")
Accuracy: 0.65
True Positive Rate:  0.44
True Negative Rate: 0.78
Precision: 0.53
False Negative Rate: 0.56
False Positive Rate:  0.22

So, it was better at recognizing questions that were different. We could probably fiddle with the threshold to make it more one way or the other, if we needed to.

Siamese Networks: Training the Model

Beginning

Now we are going to train the Siamese Network Model model. As usual, we have to define the cost function and the optimizer. We also have to feed in the built model. Before, going into the training, we will use a special data set up. We will define the inputs using the data generator we built above. The lambda function acts as a seed to remember the last batch that was given. Run the cell below to get the question pairs inputs.

Imports

# python
from collections import namedtuple
from functools import partial
from pathlib import Path
from tempfile import TemporaryFile

import sys

# pypi
from holoviews import opts

import holoviews
import hvplot.pandas
import jax
import numpy
import pandas
import trax

# this project
from neurotic.nlp.siamese_networks import (
    DataGenerator,
    DataLoader,
    SiameseModel,
    TOKENS,
    triplet_loss_layer,
)

from graeae import Timer, EmbedHoloviews

Set Up

The Timer And Plotting

TIMER = Timer()
slug = "siamese-networks-training-the-model"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{slug}")

Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

The Data

loader = DataLoader()

data = loader.data

The Data generator

batch_size = 256
train_generator = DataGenerator(data.train.question_one, data.train.question_two,
                                batch_size=batch_size)
validation_generator = DataGenerator(data.validate.question_one,
                                     data.validate.question_two,
                                     batch_size=batch_size)
print(f"training question 1 rows: {len(data.train.question_one):,}")
print(f"validation question 1 rows: {len(data.validate.question_one):,}")
training question 1 rows: 89,179
validation question 1 rows: 22,295

Middle

Training the Model

We will now write a function that takes in the model and trains it. To train the model we have to decide how many times to iterate over the entire data set; each iteration is defined as an epoch. For each epoch, you have to go over all the data, using the training iterator.

  • Create TrainTask and EvalTask
  • Create the training loop trax.supervised.training.Loop
  • Pass in the following depending on the context (train_task or eval_task):
    • labeled_data=generator
    • metrics[TripletLoss()]=,
    • loss_layer=TripletLoss()
    • optimizer=trax.optimizers.Adam with learning rate of 0.01
    • lr_schedule=lr_schedule,
    • output_dir=output_dir

We will be using the triplet loss function with Adam optimizer. Please read the trax Adam documentation to get a full understanding.

This function should return a training.Loop object. To read more about this check the training.Loop documentation.

lr_schedule = trax.lr.warmup_and_rsqrt_decay(400, 0.01)
def train_model(Siamese, TripletLoss, lr_schedule, train_generator=train_generator, val_generator=validation_generator, output_dir="~/models/siamese_networks/",
                steps_per_checkpoint=100):
    """Training the Siamese Model

    Args:
       Siamese (function): Function that returns the Siamese model.
       TripletLoss (function): Function that defines the TripletLoss loss function.
       lr_schedule (function): Trax multifactor schedule function.
       train_generator (generator, optional): Training generator. Defaults to train_generator.
       val_generator (generator, optional): Validation generator. Defaults to val_generator.
       output_dir (str, optional): Path to save model to. Defaults to 'model/'.

    Returns:
       trax.supervised.training.Loop: Training loop for the model.
    """
    output_dir = Path(output_dir).expanduser()

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    train_task = trax.supervised.training.TrainTask(
        labeled_data=train_generator,       # Use generator (train)
        loss_layer=TripletLoss(),         # Use triplet loss. Don't forget to instantiate this object
        optimizer=trax.optimizers.Adam(0.01),          # Don't forget to add the learning rate parameter
        lr_schedule=lr_schedule, # Use Trax multifactor schedule function
        n_steps_per_checkpoint=steps_per_checkpoint,
    )

    eval_task = trax.supervised.training.EvalTask(
        labeled_data=val_generator,       # Use generator (val)
        metrics=[TripletLoss()],          # Use triplet loss. Don't forget to instantiate this object
    )

    ### END CODE HERE ###

    training_loop = trax.supervised.training.Loop(Siamese,
                                                  [train_task],
                                                  eval_tasks=[eval_task],
                                                  output_dir=output_dir)

    return training_loop

Training

Trial Two

Note: I re-ran this next code block so it's actually the second run.

train_steps = 2000
siamese = SiameseModel(len(loader.vocabulary))
training_loop = train_model(siamese.model, triplet_loss_layer, lr_schedule, steps_per_checkpoint=5)

real_stdout = sys.stdout

TIMER.emit = False
TIMER.start()
with TemporaryFile("w") as temp_file:
    sys.stdout = temp_file
    training_loop.run(train_steps)
TIMER.stop()
sys.stdout = real_stdout
print(f"{TIMER.ended - TIMER.started}")
0:19:46.056057
for mode in training_loop.history.modes:
    print(mode)
    print(training_loop.history.metrics_for_mode(mode))
eval
['metrics/TripletLoss']
train
['metrics/TripletLoss', 'training/gradients_l2', 'training/learning_rate', 'training/loss', 'training/steps per second', 'training/weights_l2']
  • Plotting the Metrics

    Note: As of February 2021, the version of trax on pypi doesn't have a history attribute - to get it you have to install the code from the github repository.

    frame = pandas.DataFrame(training_loop.history.get("eval", "metrics/TripletLoss"), columns="Batch TripletLoss".split())
    
    minimum = frame.loc[frame.TripletLoss.idxmin()]
    vline = holoviews.VLine(minimum.Batch).opts(opts.VLine(color=PLOT.red))
    hline = holoviews.HLine(minimum.TripletLoss).opts(opts.HLine(color=PLOT.red))
    line = frame.hvplot(x="Batch", y="TripletLoss").opts(opts.Curve(color=PLOT.blue))
    
    plot = (line * hline * vline).opts(
        width=PLOT.width, height=PLOT.height,
        title="Evaluation Batch Triplet Loss",
                                       )
    output = Embed(plot=plot, file_name="evaluation_triplet_loss")()
    
    print(output)
    

    Figure Missing

    It looks the loss is stabilizing. If it doesn't perform well I'll re-train it.

Trial Three

Let's see if the continues going down.

train_steps = 2000
siamese = SiameseModel(len(loader.vocabulary))
training_loop = train_model(siamese.model, triplet_loss_layer, lr_schedule, steps_per_checkpoint=5)

real_stdout = sys.stdout

TIMER.emit = False
TIMER.start()
with TemporaryFile("w") as temp_file:
    sys.stdout = temp_file
    training_loop.run(train_steps)
TIMER.stop()
sys.stdout = real_stdout
print(f"{TIMER.ended - TIMER.started}")
0:17:41.167719
  • Plotting the Metrics
    frame = pandas.DataFrame(
        training_loop.history.get("eval", "metrics/TripletLoss"),
        columns="Batch TripletLoss".split())
    
    minimum = frame.loc[frame.TripletLoss.idxmin()]
    vline = holoviews.VLine(minimum.Batch).opts(opts.VLine(color=PLOT.red))
    hline = holoviews.HLine(minimum.TripletLoss).opts(opts.HLine(color=PLOT.red))
    line = frame.hvplot(x="Batch", y="TripletLoss").opts(opts.Curve(color=PLOT.blue))
    
    plot = (line * hline * vline).opts(
        width=PLOT.width, height=PLOT.height,
        title="Evaluation Batch Triplet Loss (Third Run)",
                                       )
    output = Embed(plot=plot, file_name="evaluation_triplet_loss_third")()
    
    print(output)
    

    Figure Missing

    It looks like it stopped improving. Probably time to stop.

Siamese Networks: Hard Negative Mining

Hard Negative Mining

Now we will now implement the TripletLoss. Loss is composed of two terms. One term utilizes the mean of all the non duplicates, the second utilizes the closest negative. Our loss expression is then:

\begin{align} \mathcal{Loss_1(A,P,N)} &=\max \left( -cos(A,P) + mean_{neg} +\alpha, 0\right) \\ \mathcal{Loss_2(A,P,N)} &=\max \left( -cos(A,P) + closest_{neg} +\alpha, 0\right) \\ \mathcal{Loss(A,P,N)} &= mean(Loss_1 + Loss_2) \\ \end{align}

Here is a list of things we have to do:

  • As this will be run inside trax, use fastnp.xyz when using any xyz numpy function
  • Use fastnp.dot to calculate the similarity matrix \(v_1v_2^T\) of dimension batch_size x batch_size
  • Take the score of the duplicates on the diagonal fastnp.diagonal
  • Use the trax functions fastnp.eye and fastnp.maximum for the identity matrix and the maximum.

Imports

# python
from functools import partial

# pypi
from trax.fastmath import numpy as fastnp
from trax import layers

import jax
import numpy

Implementation

More Detailed Instructions

We'll describe the algorithm using a detailed example. Below, V1, V2 are the output of the normalization blocks in our model. Here we will use a batch_size of 4 and a d_model of 3. The inputs, Q1, Q2 are arranged so that corresponding inputs are duplicates while non-corresponding entries are not. The outputs will have the same pattern.

This testcase arranges the outputs, v1,v2, to highlight different scenarios. Here, the first two outputs V1[0], V2[0] match exactly - so the model is generating the same vector for Q1[0] and Q2[0] inputs. The second outputs differ, circled in orange, we set, V2[1] is set to match V2[**2**], simulating a model which is generating very poor results. V1[3] and V2[3] match exactly again while V1[4] and V2[4] are set to be exactly wrong - 180 degrees from each other, circled in blue.

Cosine Similarity

The first step is to compute the cosine similarity matrix or score in the code. This is \(V_1 V_2^T\) which is generated with fastnp.dot.

The clever arrangement of inputs creates the data needed for positive and negative examples without having to run all pair-wise combinations. Because Q1[n] is a duplicate of only Q2[n], other combinations are explicitly created negative examples or Hard Negative examples. The matrix multiplication efficiently produces the cosine similarity of all positive/negative combinations as shown above on the left side of the diagram. 'Positive' are the results of duplicate examples and 'negative' are the results of explicitly created negative examples. The results for our test case are as expected, V1[0]V2[0] match producing '1' while our other 'positive' cases (in green) don't match well, as was arranged. The V2[2] was set to match V1[3] producing a poor match at score[2,2] and an undesired 'negative' case of a '1' shown in grey.

With the similarity matrix (score) we can begin to implement the loss equations. First, we can extract \(\cos(A,P)\) by utilizing fastnp.diagonal. The goal is to grab all the green entries in the diagram above. This is positive in the code.

Closest Negative

Next, we will create the closest_negative. This is the nonduplicate entry in V2 that is closest (has largest cosine similarity) to an entry in V1. Each row, n, of score represents all comparisons of the results of Q1[n] vs Q2[x] within a batch. A specific example in our testcase is row score[2,:]. It has the cosine similarity of V1[2] and V2[x]. The closest_negative, as was arranged, is V2[2] which has a score of 1. This is the maximum value of the 'negative' entries (blue entries in the diagram).

To implement this, we need to pick the maximum entry on a row of score, ignoring the 'positive'/green entries. To avoid selecting the 'positive'/green entries, we can make them larger negative numbers. Multiply fastnp.eye(batch_size) with 2.0 and subtract it out of scores. The result is negative_without_positive. Now we can use fastnp.max, row by row (axis=1), to select the maximum which is closest_negative.

Mean Negative

Next, we'll create mean_negative. As the name suggests, this is the mean of all the 'negative'/blue values in score on a row by row basis. We can use fastnp.eye(batch_size) and a constant, this time to create a mask with zeros on the diagonal. Element-wise multiply this with score to get just the 'negative values. This is negative_zero_on_duplicate in the code. Compute the mean by using fastnp.sum on negative_zero_on_duplicate for axis=1 and divide it by (batch_size - 1) . This is mean_negative.

Now, we can compute loss using the two equations above and fastnp.maximum. This will form triplet_loss1 and triplet_loss2.

triple_loss is the fastnp.mean of the sum of the two individual losses.

def TripletLossFn(v1: numpy.ndarray, v2: numpy.ndarray,
                  margin: float=0.25) -> jax.interpreters.xla.DeviceArray:
    """Custom Loss function.

    Args:
       v1 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q1.
       v2 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q2.
       margin (float, optional): Desired margin. Defaults to 0.25.

    Returns:
       jax.interpreters.xla.DeviceArray: Triplet Loss.
    """
    # use fastnp to take the dot product of the two batches (don't forget to transpose the second argument)
    scores = fastnp.dot(v1, v2.T)
    # calculate new batch size
    batch_size = len(scores)
    # use fastnp to grab all postive =diagonal= entries in =scores=
    positive = fastnp.diagonal(scores)  # the positive ones (duplicates)
    # multiply =fastnp.eye(batch_size)= with 2.0 and subtract it out of =scores=
    negative_without_positive = scores - (fastnp.eye(batch_size) * 2.0)
    # take the row by row =max= of =negative_without_positive=. 
    # Hint: negative_without_positive.max(axis = [?])  
    closest_negative = fastnp.max(negative_without_positive, axis=1)
    # subtract =fastnp.eye(batch_size)= out of 1.0 and do element-wise multiplication with =scores=
    negative_zero_on_duplicate = (1.0 - fastnp.eye(batch_size)) * scores
    # use =fastnp.sum= on =negative_zero_on_duplicate= for =axis=1= and divide it by =(batch_size - 1)= 
    mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1)/(batch_size - 1)
    # compute =fastnp.maximum= among 0.0 and =A=
    # A = subtract =positive= from =margin= and add =closest_negative= 
    triplet_loss1 = fastnp.maximum(0, margin - positive + closest_negative)
    # compute =fastnp.maximum= among 0.0 and =B=
    # B = subtract =positive= from =margin= and add =mean_negative=
    triplet_loss2 = fastnp.maximum(0, (margin - positive) + mean_negative)
    # add the two losses together and take the =fastnp.mean= of it
    triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
    return triplet_loss
v1 = numpy.array([[0.26726124, 0.53452248, 0.80178373],[0.5178918 , 0.57543534, 0.63297887]])
v2 = numpy.array([[ 0.26726124,  0.53452248,  0.80178373],[-0.5178918 , -0.57543534, -0.63297887]])
triplet_loss = TripletLossFn(v2, v1)
print(f"Triplet Loss: {triplet_loss}")

assert triplet_loss == 0.5
Triplet Loss: 0.5

To make a layer out of a function with no trainable variables, use tl.Fn.

from functools import partial
def TripletLoss(margin=0.25):
    triplet_loss_fn = partial(TripletLossFn, margin=margin)
    return layers.Fn('TripletLoss', triplet_loss_fn)

Bundle It Up

Unfortunately trax does some kind of weirdness where it counts the arguments of the things you use as layers, so class-based stuff won't work (because it counts the self argument, giving it too many to expect). There might be a way to work around this, but it doesn't appear to be documented so this has to be done with only functions. That's not bad, it's just unexpected (and not well documented).

Imports

# python
from functools import partial

# from pypi
from trax.fastmath import numpy as fastmath_numpy
from trax import layers

import attr
import jax
import numpy
import trax

Triplet Loss

def triplet_loss(v1: numpy.ndarray,
             v2: numpy.ndarray, margin: float=0.25)-> jax.interpreters.xla.DeviceArray:
    """Calculates the triplet loss

    Args:
     v1: normalized batch for question 1
     v2: normalized batch for question 2

    Returns:
     triplet loss
    """
    scores = fastmath_numpy.dot(v1, v2.T)
    batch_size = len(scores)
    positive = fastmath_numpy.diagonal(scores)
    negative_without_positive = scores - (fastmath_numpy.eye(batch_size) * 2.0)
    closest_negative = fastmath_numpy.max(negative_without_positive, axis=1)
    negative_zero_on_duplicate = (1.0 - fastmath_numpy.eye(batch_size)) * scores
    mean_negative = fastmath_numpy.sum(negative_zero_on_duplicate, axis=1)/(batch_size - 1)
    triplet_loss1 = fastmath_numpy.maximum(0, margin - positive + closest_negative)
    triplet_loss2 = fastmath_numpy.maximum(0, (margin - positive) + mean_negative)
    return fastmath_numpy.mean(triplet_loss1 + triplet_loss2)

Triplet Loss Layer

Another not well documented limitation is that the function you create the layer from isn't allowed to take have default values, so if we want to allow the margin to have a default, we have to use partial to set the value before creating the layer…

def triplet_loss_layer(margin: float=0.25) -> layers.Fn:
    """Converts the triplet_loss function to a trax layer"""
    with_margin = partial(triplet_loss, margin=margin)
    return layers.Fn("TripletLoss", with_margin)

Check It Out

from neurotic.nlp.siamese_networks import triplet_loss_layer

layer = triplet_loss_layer()
print(type(layer))
<class 'trax.layers.base.PureLayer'>

Siamese Networks: Defining the Model

Understanding the Siamese Network

A Siamese network is a neural network which uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.

You get the question embedding, run it through an LSTM layer, normalize \(v_1\) and \(v_2\), and finally use a triplet loss (explained below) to get the corresponding cosine similarity for each pair of questions. As usual, you will start by importing the data set. The triplet loss makes use of a baseline (anchor) input that is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized. In math equations, you are trying to maximize the following.

\[ \mathcal{L}(A, P, N)=\max \left(\|\mathrm{f}(A)-\mathrm{f}(P)\|^{2}-\|\mathrm{f}(A)-\mathrm{f}(N)\|^{2}+\alpha, 0\right) \]

A is the anchor input, for example \(q1_1\), \(P\) the duplicate input, for example, \(q2_1\), and \(N\) the negative input (the non duplicate question), for example \(q2_2\). \(\alpha\) is a margin; you can think about it as a safety net, or by how much you want to push the duplicates from the non duplicates.

Imports

# from pypi
import trax.fastmath.numpy as fastnp
import trax.layers as tl
# This Project
from neurotic.nlp.siamese_networks import DataLoader, TOKENS

Set Up

loader = DataLoader()

data = loader.data

Implementation

To implement this model, you will be using `trax`. Concretely, you will be using the following functions.

  • tl.Serial: Combinator that applies layers serially (by function composition) allows you set up the overall structure of the feedforward. docs / source code
    • You can pass in the layers as arguments to Serial, separated by commas.
    • For example: tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))
  • tl.Embedding: Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding. docs / source code
    • tl.Embedding(vocab_size, d_feature).
    • vocab_size is the number of unique words in the given vocabulary.
    • d_feature is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
  • tl.LSTM The LSTM layer. It leverages another Trax layer called LSTMCell. The number of units should be specified and should match the number of elements in the word embedding. docs / source code
    • tl.LSTM(n_units) Builds an LSTM layer of n_units.
  • tl.Mean: Computes the mean across a desired axis. Mean uses one tensor axis to form groups of values and replaces each group with the mean value of that group. docs / source code
    • tl.Mean(axis=1) mean over columns.
  • tl.Fn Layer with no weights that applies the function f, which should be specified using a lambda syntax. docs / source code
    • x -> This is used for cosine similarity.
    • tl.Fn('Normalize', lambda x: normalize(x)) Returns a layer with no weights that applies the function f
  • tl.parallel: It is a combinator layer (like Serial) that applies a list of layers in parallel to its inputs. docs / source code
def Siamese(vocab_size=len(loader.vocabulary), d_model=128, mode='train'):
    """Returns a Siamese model.

    Args:
       vocab_size (int, optional): Length of the vocabulary. Defaults to len(vocab).
       d_model (int, optional): Depth of the model. Defaults to 128.
       mode (str, optional): 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to 'train'.

    Returns:
       trax.layers.combinators.Parallel: A Siamese model. 
    """

    def normalize(x):  # normalizes the vectors to have L2 norm 1
        return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))

    q_processor = tl.Serial(  # Processor will run on Q1 and Q2.
        tl.Embedding(vocab_size, d_model), # Embedding layer
        tl.LSTM(d_model), # LSTM layer
        tl.Mean(axis=1), # Mean over columns
        tl.Fn("Normalize", normalize)  # Apply normalize function
    )  # Returns one vector of shape [batch_size, d_model].

    # Run on Q1 and Q2 in parallel.
    model = tl.Parallel(q_processor, q_processor)
    return model

Check the Model

model = Siamese()
print(model)
Parallel_in2_out2[
  Serial[
    Embedding_77068_128
    LSTM_128
    Mean
    Normalize
  ]
  Serial[
    Embedding_77068_128
    LSTM_128
    Mean
    Normalize
  ]
]

Bundle It Up

<<imports>>

<<constants>>

<<normalize>>


<<siamese-network>>

    <<the-processor>>

    <<the-model>>

Imports

# python
from collections import namedtuple

# pypi
from trax import layers
from trax.fastmath import numpy as fastmath_numpy

import attr
import numpy
import trax

Constants

Axis = namedtuple("Axis", ["columns", "last"])
Constants = namedtuple("Constants", ["model_depth", "axis"])

AXIS = Axis(1, -1)

CONSTANTS = Constants(128, AXIS)

Normalize

def normalize(x: numpy.ndarray) -> numpy.ndarray:
    """Normalizes the vectors to have L2 norm 1

    Args:
     x: the array of vectors to normalize

    Returns:
     normalized version of x
    """
    return x/fastmath_numpy.sqrt(fastmath_numpy.sum(x**2,
                                                    axis=CONSTANTS.axis.last,
                                                    keepdims=True))

The Siamese Model

@attr.s(auto_attribs=True)
class SiameseModel:
    """The Siamese network model

    Args:
     vocabulary_size: number of tokens in the vocabulary
     model_depth: depth of our embedding layer
     mode: train|eval|predict
    """
    vocabulary_size: int
    model_depth: int=CONSTANTS.model_depth
    mode: str="train"
    _processor: trax.layers.combinators.Serial=None
    _model: trax.layers.combinators.Parallel=None

The Processor

@property
def processor(self) -> trax.layers.Serial:
    """The Question Processor"""
    if self._processor is None:
        self._processor = layers.Serial(
            layers.Embedding(self.vocabulary_size, self.model_depth),
            layers.LSTM(self.model_depth),
            layers.Mean(axis=CONSTANTS.axis.columns),
            layers.Fn("Normalize", normalize) 
        ) 
    return self._processor

The Model

@property
def model(self) -> trax.layers.Parallel:
    """The Siamese Model"""
    if self._model is None:
        processor = layers.Serial(
            layers.Embedding(self.vocabulary_size, self.model_depth),
            layers.LSTM(self.model_depth),
            layers.Mean(axis=CONSTANTS.axis.columns),
            layers.Fn("Normalize", normalize) 
        ) 

        self._model = layers.Parallel(processor, processor)
    return self._model

Check It Out

from neurotic.nlp.siamese_networks import SiameseModel

model = SiameseModel(len(loader.vocabulary))
print(model.model)
Parallel_in4_out2[
  Serial_in2[
    Embedding_77068_128
    LSTM_128
    Mean
    Normalize_in2
  ]
  Serial_in2[
    Embedding_77068_128
    LSTM_128
    Mean
    Normalize_in2
  ]
]

Siamese Networks: The Data Generator

Beginning

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. If you were to use stochastic gradient descent with one example at a time, it would take you forever to build a model. In this example, we show you how you can build a data generator that takes in \(Q1\) and \(Q2\) and returns a batch of size batch_size in the following format \(([q1_1, q1_2, q1_3, ...]\), \([q2_1, q2_2,q2_3, ...])\). The tuple consists of two arrays and each array has batch_size questions. Again, \(q1_i\) and \(q2_i\) are duplicates, but they are not duplicates with any other elements in the batch.

The iterator that we're going to create returns a pair of arrays of questions.

We'll implement the data generator below. Here are some things we will need.

  • While true loop.
  • if index > len_Q1=, set the idx to \(0\).
  • The generator should return shuffled batches of data. To achieve this without modifying the actual question lists, a list containing the indexes of the questions is created. This list can be shuffled and used to get random batches everytime the index is reset.
  • Append elements of \(Q1\) and \(Q2\) to input1 and input2 respectively.
  • if len(input1) = batch_size=, determine max_len as the longest question in input1 and input2. Ceil max_len to a power of \(2\) (for computation purposes) using the following command: max_len = 2**int(np.ceil(np.log2(max_len))).
  • Pad every question by vocab['<PAD>'] until you get the length max_len.
  • Use yield to return input1, input2.
  • Don't forget to reset input1, input2 to empty arrays at the end (data generator resumes from where it last left).

Imports

# python
import random

# pypi
import numpy

# this project
from neurotic.nlp.siamese_networks import DataLoader, TOKENS

Set Up

Our Data

loader = DataLoader()

data = loader.data

The Idiotic Names

np = numpy
rnd = random

Middle

def data_generator(Q1:list, Q2:list, batch_size: int,
                   pad: int=1, shuffle: bool=True):
    """Generator function that yields batches of data

    Args:
       Q1 (list): List of transformed (to tensor) questions.
       Q2 (list): List of transformed (to tensor) questions.
       batch_size (int): Number of elements per batch.
       pad (int, optional): Pad character from the vocab. Defaults to 1.
       shuffle (bool, optional): If the batches should be randomnized or not. Defaults to True.

    Yields:
       tuple: Of the form (input1, input2) with types (numpy.ndarray, numpy.ndarray)
       NOTE: input1: inputs to your model [q1a, q2a, q3a, ...] i.e. (q1a,q1b) are duplicates
             input2: targets to your model [q1b, q2b,q3b, ...] i.e. (q1a,q2i) i!=a are not duplicates
    """

    input1 = []
    input2 = []
    idx = 0
    len_q = len(Q1)
    question_indexes = list(range(len_q))

    if shuffle:
        rnd.shuffle(question_indexes)
    while True:
        if idx >= len_q:
            # if idx is greater than or equal to len_q, set idx accordingly 
            # (Hint: look at the instructions above)
            idx = 0
            # shuffle to get random batches if shuffle is set to True
            if shuffle:
                rnd.shuffle(question_indexes)

        # get questions at the `question_indexes[idx]` position in Q1 and Q2
        q1 = Q1[question_indexes[idx]]
        q2 = Q2[question_indexes[idx]]

        # increment idx by 1
        idx += 1
        # append q1
        input1.append(q1)
        # append q2
        input2.append(q2)
        if len(input1) == batch_size:
            # determine max_len as the longest question in input1 & input 2
            # Hint: use the `max` function. 
            # take max of input1 & input2 and then max out of the two of them.
            max_len = max(max(len(question) for question in input1),
                          max(len(question) for question in input2))
            print(max_len)
            # pad to power-of-2 (Hint: look at the instructions above)
            max_len = 2**int(np.ceil(np.log2(max_len)))
            print(max_len)
            b1 = []
            b2 = []
            for q1, q2 in zip(input1, input2):
                # add [pad] to q1 until it reaches max_len
                q1 = q1 + ((max_len - len(q1)) * [pad])
                # add [pad] to q2 until it reaches max_len
                q2 = q2 + ((max_len - len(q2)) * [pad])
                # append q1
                b1.append(q1)
                # append q2
                b2.append(q2)
            # use b1 and b2
            yield np.array(b1), np.array(b2)
            # reset the batches
            input1, input2 = [], []  # reset the batches

Try It Out

rnd.seed(34)
batch_size = 2
generator = data_generator(data.train.question_one, data.train.question_two, batch_size)
result_1, result_2 = next(generator)
print(f"First questions  : \n{result_1}\n")
print(f"Second questions : \n{result_2}")
11
16
First questions  : 
[[  34   37   13   50  536 1303 6428   25  924  157   28    1    1    1
     1    1]
 [  34   95  573 1444 2343   28    1    1    1    1    1    1    1    1
     1    1]]

Second questions : 
[[  34   37   13  575 1303 6428   25  924  157   28    1    1    1    1
     1    1]
 [   9  151   25  573 5642   28    1    1    1    1    1    1    1    1
     1    1]]

Bundling It Up

Imports

# python
from collections import namedtuple

import random

# pypi
import attr
import numpy

# this project
from neurotic.nlp.siamese_networks import DataLoader, TOKENS

The Data Generator

@attr.s(auto_attribs=True)
class DataGenerator:
    """Batch Generator for Quora question dataset

    Args:
     question_one: tensorized question 1
     question_two: tensorized question 2
     batch_size: size of generated batches
     padding: token to use to pad the lists
     shuffle: whether to shuffle the questions around
    """
    question_one: numpy.ndarray
    question_two: numpy.ndarray
    batch_size: int
    padding: int=TOKENS.padding
    shuffle: bool=True
    _batch: iter=None

The Generator Definition

def data_generator(self):
    """Generator function that yields batches of data

    Yields:
       tuple: (batch_question_1, batch_question_2)
    """
    unpadded_1 = []
    unpadded_2 = []
    index = 0
    number_of_questions = len(self.question_one)
    question_indexes = list(range(number_of_questions))

    if self.shuffle:
        random.shuffle(question_indexes)

    while True:
        if index >= number_of_questions:
            index = 0
            if self.shuffle:
                random.shuffle(question_indexes)

        unpadded_1.append(self.question_one[question_indexes[index]])
        unpadded_2.append(self.question_two[question_indexes[index]])

        index += 1

        if len(unpadded_1) == self.batch_size:
            max_len = max(max(len(question) for question in unpadded_1),
                          max(len(question) for question in unpadded_2))
            max_len = 2**int(numpy.ceil(numpy.log2(max_len)))
            padded_1 = []
            padded_2 = []
            for question_1, question_2 in zip(unpadded_1, unpadded_2):
                padded_1.append(question_1 + ((max_len - len(question_1)) * [self.padding]))
                padded_2.append(question_2 +  ((max_len - len(question_2)) * [self.padding]))
            yield numpy.array(padded_1), numpy.array(padded_2)
            unpadded_1, unpadded_2 = [], []
    return

The Generator

@property
def batch(self):
    """The generator instance"""
    if self._batch is None:
        self._batch = self.data_generator()
    return self._batch

The Iter Method

def __iter__(self):
    return self

The Next Method

def __next__(self):
    return next(self.batch)

Check It Out

from neurotic.nlp.siamese_networks import DataGenerator, DataLoader

loader = DataLoader()
data = loader.data
generator = DataGenerator(data.train.question_one, data.train.question_two, batch_size=2)
random.seed(34)
batch_size = 2
result_1, result_2 = next(generator)
print(f"First questions  : \n{result_1}\n")
print(f"Second questions : \n{result_2}")
First questions  : 
[[  34   37   13   50  536 1303 6428   25  924  157   28    1    1    1
     1    1]
 [  34   95  573 1444 2343   28    1    1    1    1    1    1    1    1
     1    1]]

Second questions : 
[[  34   37   13  575 1303 6428   25  924  157   28    1    1    1    1
     1    1]
 [   9  151   25  573 5642   28    1    1    1    1    1    1    1    1
     1    1]]