# Siamese Networks: New Questions

## Trying New Questions

### Imports

# python
from pathlib import Path

# pypi
import nltk
import numpy
import pandas
import trax

# this project
from neurotic.nlp.siamese_networks import (
DataGenerator,
SiameseModel,
TOKENS,
)


### Set Up

#### The Data

data_generator = DataGenerator


#### The Model

siamese = SiameseModel(len(vocabulary))
path = Path("~/models/siamese_networks/model.pkl.gz").expanduser()
weights = siamese.model.init_from_file(path, weights_only=True)
model = siamese.model


## Implementing It

Write a function =predict=that takes in two questions, the model, and the vocabulary and returns whether the questions are duplicates (1) or not duplicates (0) given a similarity threshold.

Instructions:

• Tokenize your question using nltk.word_tokenize
• Create Q1,Q2 by encoding your questions as a list of numbers using vocab
• use model() to create v1, v2
• compute the cosine similarity (dot product) of v1, v2
• compute res by comparing d to the threshold
def predict(question1: str, question2: str,
threshold: float=0.7, model: trax.layers.Parallel=model,
vocab: dict=vocabulary, data_generator: type=data_generator,
verbose: bool=True) -> bool:
"""Function for predicting if two questions are duplicates.

Args:
question1 (str): First question.
question2 (str): Second question.
threshold (float): Desired threshold.
model (trax.layers.combinators.Parallel): The Siamese model.
vocab (collections.defaultdict): The vocabulary used.
data_generator (function): Data generator function. Defaults to data_generator.
verbose (bool, optional): If the results should be printed out. Defaults to False.

Returns:
bool: True if the questions are duplicates, False otherwise.
"""
question_one = [[vocab[word] for word in nltk.word_tokenize(question1)]]
question_two = [[vocab[word] for word in nltk.word_tokenize(question2)]]

questions = next(data_generator(question_one,
question_two,
batch_size=1))
vector_1, vector_2 = model(questions)
similarity = float(numpy.dot(vector_1, vector_2.T))
same_question = similarity > threshold

if(verbose):
print(f"Q1  = {questions[0]}")
print(f"Q2 = {questions[1]}")
print(f"Similarity : {float(similarity):0.2f}")
print(f"They are the same question: {same_question}")
return same_question


### Some Trials

print(TOKENS)

Tokens(unknown=0, padding=1, padding_token='<PAD>')


So if we see a 0 in the tokens then we know the word wasn't in the vocabulary.

question1 = "When will I see you?"
question2 = "When can I see you again?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocabulary, verbose = True)

Q1  = [[581  64  20  44  49  16   1   1]]
Q2 = [[ 581   39   20   44   49 7280   16    1]]
Similarity : 0.95
They are the same question: True

question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"

predict(question1 , question2, 0.7, model, vocabulary, verbose=True)

Q1  = [[  446  1138  3159  1169    70 29016    16     1]]
Q2 = [[  446  1138    57 15302    24    70  7430    16]]
Similarity : 0.60
They are the same question: False

predict("Do cows have butts?", "Do dogs have bones?")

Q1  = [[  446  5757   216 25442    16     1     1     1]]
Q2 = [[  446   788   216 11192    16     1     1     1]]
Similarity : 0.25
They are the same question: False

predict("Do cows from Lancashire have butts?", "Do dogs have bones as big as whales?")

Q1  = [[  446  5757   125     0   216 25442    16     1     1     1     1     1
1     1     1     1]]
Q2 = [[  446   788   216 11192   249  1124   249 30836    16     1     1     1
1     1     1     1]]
Similarity : 0.13
They are the same question: False

predict("Can pigs fly?", "Are you my mother?")

Q1  = [[  221 14137  5750    16     1     1     1     1]]
Q2 = [[ 517   49   41 1585   16    1    1    1]]
Similarity : 0.01
They are the same question: False

predict("Shall we dance?", "Shall I fart?")

Q1  = [[19382   138  4201    16]]
Q2 = [[19382    20 18288    16]]
Similarity : 0.71
They are the same question: True


Hm… surprising that "fart" was in the data set, and it's the same as dancing.

farts = loader.training_data[loader.training_data.question2.str.contains("fart[^a-z]")]
print(len(farts))

16
19820                                    Can penguins fart?
60745       How do I control a fart when I'm about to fart?
83124           What word square starts with the word fart?
96707         Which part of human body is called fart pump?
120727    Why do people fart more when they wake up in t...
Name: question2, dtype: object


Maybe I shouldn't have been surprised.

predict("Am I man or gorilla?", "Am I able to eat the pasta?")

Q1  = [[4311   20 1215   75 7438   16    1    1]]
Q2 = [[ 4311    20   461    37   922    70 14552    16]]
Similarity : 0.20
They are the same question: False


It looks like the model only looks at the first words… at least when the sentences are short.

predict("Will we return to Mars or go instead to Venus?", "Will we eat rice with plums and cherry topping?")

Q1  = [[  168   141  8303    34  6861    72  1315  4536    34 15555    16     1
1     1     1     1]]
Q2 = [[  168   141   927  7612   121     0     9 19275     0    16     1     1
1     1     1     1]]
Similarity : 0.67
They are the same question: False


Siamese networks are important and useful. Many times there are several questions that are already asked in quora, or other platforms and you can use Siamese networks to avoid question duplicates.

# Siamese Networks: Evaluating the Model

## Evaluating the Siamese Network

### Force CPU Use

For some reason the model eats up more and more memory on the GPU until it runs out. Seems like a memory leak. Anyway, for reasons that I don't know, the way that tensorflow tells you to disable using the GPU doesn't work (it's in the second code block) so to get this to work I have to essentially break the CUDA settings.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""


This is the way they tell you to do it.

import tensorflow
tensorflow.config.set_visible_devices([], "GPU")


### Imports

# python
from collections import namedtuple
from pathlib import Path

# pypi
import numpy
import trax

# this project
from neurotic.nlp.siamese_networks import (
DataGenerator,
SiameseModel,
)

# other
from graeae import Timer


### Set Up

#### The Data

loader = DataLoader()

y_test = data.y_test
testing = data.test

del(data)


#### The Timer

TIMER = Timer()


#### The Model

siamese = SiameseModel(vocabulary_length)
path = Path("~/models/siamese_networks/model.pkl.gz").expanduser()
weights = siamese.model.init_from_file(path, weights_only=True)


## Classify

To determine the accuracy of the model, we will utilize the test set that was configured earlier. While in training we used only positive examples, the test data, Q1_test, Q2_test and y_test, is setup as pairs of questions, some of which are duplicates some are not.

This routine will run all the test question pairs through the model, compute the cosine simlarity of each pair, threshold it and compare the result to y_test - the correct response from the data set. The results are accumulated to produce an accuracy.

Instructions

• Loop through the incoming data in batch_size chunks
• Use the data generator to load q1, q2 a batch at a time. Don't forget to set shuffle=False!
• copy a batch_size chunk of y into y_test
• compute v1, v2 using the model
• for each element of the batch
• compute the cos similarity of each pair of entries, v1[j],v2[j]
• determine if d > threshold
• increment accuracy if that result matches the expected results (y_test[j])
• compute the final accuracy and return
Outcome = namedtuple("Outcome", ["accuracy", "true_positive",
"true_negative", "false_positive",
"false_negative"])

def classify(data_generator: iter,
y: numpy.ndarray,
threshold: float,
model: trax.layers.Parallel):
"""Function to test the accuracy of the model.

Args:
data_generator: batch generator,
y: Array of actual target.
threshold: minimum distance to be considered the same
model: The Siamese model.
Returns:
float: Accuracy of the model.
"""
accuracy = 0
true_positive = false_positive = true_negative = false_negative = 0
batch_start = 0

for batch_one, batch_two in data_generator:
batch_size = len(batch_one)
batch_stop = batch_start + batch_size

if batch_stop >= len(y):
break
batch_labels = y[batch_start: batch_stop]
vector_one, vector_two = model((batch_one, batch_two))
batch_start = batch_stop

for row in range(batch_size):
similarity = numpy.dot(vector_one[row], vector_two[row].T)
same_question = int(similarity > threshold)
correct = same_question == batch_labels[row]
if same_question:
if correct:
true_positive += 1
else:
false_positive += 1
else:
if correct:
true_negative += 1
else:
false_negative += 1
accuracy += int(correct)
return Outcome(accuracy=accuracy/len(y),
true_positive = true_positive,
true_negative = true_negative,
false_positive = false_positive,
false_negative = false_negative)

batch_size = 512
data_generator = DataGenerator(testing.question_one, testing.question_two,
batch_size=batch_size,
shuffle=False)

with TIMER:
outcome = classify(
data_generator=data_generator,
y=y_test,
threshold=0.7,
model=siamese.model
)
print(f"Outcome: {outcome}")

Started: 2021-02-10 21:42:27.320674
Ended: 2021-02-10 21:47:57.411380
Elapsed: 0:05:30.090706
Outcome: Outcome(accuracy=0.6546453536874203, true_positive=16439, true_negative=51832, false_positive=14425, false_negative=21240)


So, is that good or not? It might be more useful to look at the rates.

print(f"Accuracy: {outcome.accuracy:0.2f}")
true_positive = outcome.true_positive
false_negative = outcome.false_negative
true_negative = outcome.true_negative
false_positive = outcome.false_positive

print(f"True Positive Rate: {true_positive/(true_positive + false_negative): 0.2f}")
print(f"True Negative Rate: {true_negative/(true_negative + false_positive):0.2f}")
print(f"Precision: {outcome.true_positive/(true_positive + false_positive):0.2f}")
print(f"False Negative Rate: {false_negative/(false_negative + true_positive):0.2f}")
print(f"False Positive Rate: {false_positive/(false_positive + true_negative): 0.2f}")

Accuracy: 0.65
True Positive Rate:  0.44
True Negative Rate: 0.78
Precision: 0.53
False Negative Rate: 0.56
False Positive Rate:  0.22


So, it was better at recognizing questions that were different. We could probably fiddle with the threshold to make it more one way or the other, if we needed to.

# Siamese Networks: Training the Model

## Beginning

Now we are going to train the Siamese Network Model model. As usual, we have to define the cost function and the optimizer. We also have to feed in the built model. Before, going into the training, we will use a special data set up. We will define the inputs using the data generator we built above. The lambda function acts as a seed to remember the last batch that was given. Run the cell below to get the question pairs inputs.

### Imports

# python
from collections import namedtuple
from functools import partial
from pathlib import Path
from tempfile import TemporaryFile

import sys

# pypi
from holoviews import opts

import holoviews
import hvplot.pandas
import jax
import numpy
import pandas
import trax

# this project
from neurotic.nlp.siamese_networks import (
DataGenerator,
SiameseModel,
TOKENS,
triplet_loss_layer,
)

from graeae import Timer, EmbedHoloviews


### Set Up

#### The Timer And Plotting

TIMER = Timer()

slug = "siamese-networks-training-the-model"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{slug}")

Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(
width=900,
height=750,
fontscale=2,
tan="#ddb377",
blue="#4687b7",
red="#ce7b6d",
)


#### The Data

loader = DataLoader()



#### The Data generator

batch_size = 256
train_generator = DataGenerator(data.train.question_one, data.train.question_two,
batch_size=batch_size)
validation_generator = DataGenerator(data.validate.question_one,
data.validate.question_two,
batch_size=batch_size)
print(f"training question 1 rows: {len(data.train.question_one):,}")
print(f"validation question 1 rows: {len(data.validate.question_one):,}")

training question 1 rows: 89,179
validation question 1 rows: 22,295


## Middle

### Training the Model

We will now write a function that takes in the model and trains it. To train the model we have to decide how many times to iterate over the entire data set; each iteration is defined as an epoch. For each epoch, you have to go over all the data, using the training iterator.

• Create TrainTask and EvalTask
• Create the training loop trax.supervised.training.Loop
• Pass in the following depending on the context (train_task or eval_task):
• labeled_data=generator
• metrics[TripletLoss()]=,
• loss_layer=TripletLoss()
• optimizer=trax.optimizers.Adam with learning rate of 0.01
• lr_schedule=lr_schedule,
• output_dir=output_dir

This function should return a training.Loop object. To read more about this check the training.Loop documentation.

lr_schedule = trax.lr.warmup_and_rsqrt_decay(400, 0.01)

def train_model(Siamese, TripletLoss, lr_schedule, train_generator=train_generator, val_generator=validation_generator, output_dir="~/models/siamese_networks/",
steps_per_checkpoint=100):
"""Training the Siamese Model

Args:
Siamese (function): Function that returns the Siamese model.
TripletLoss (function): Function that defines the TripletLoss loss function.
lr_schedule (function): Trax multifactor schedule function.
train_generator (generator, optional): Training generator. Defaults to train_generator.
val_generator (generator, optional): Validation generator. Defaults to val_generator.
output_dir (str, optional): Path to save model to. Defaults to 'model/'.

Returns:
trax.supervised.training.Loop: Training loop for the model.
"""
output_dir = Path(output_dir).expanduser()

### START CODE HERE (Replace instances of 'None' with your code) ###

labeled_data=train_generator,       # Use generator (train)
loss_layer=TripletLoss(),         # Use triplet loss. Don't forget to instantiate this object
lr_schedule=lr_schedule, # Use Trax multifactor schedule function
n_steps_per_checkpoint=steps_per_checkpoint,
)

labeled_data=val_generator,       # Use generator (val)
metrics=[TripletLoss()],          # Use triplet loss. Don't forget to instantiate this object
)

### END CODE HERE ###

training_loop = trax.supervised.training.Loop(Siamese,
output_dir=output_dir)

return training_loop


### Training

#### Trial Two

Note: I re-ran this next code block so it's actually the second run.

train_steps = 2000
training_loop = train_model(siamese.model, triplet_loss_layer, lr_schedule, steps_per_checkpoint=5)

real_stdout = sys.stdout

TIMER.emit = False
TIMER.start()
with TemporaryFile("w") as temp_file:
sys.stdout = temp_file
training_loop.run(train_steps)
TIMER.stop()
sys.stdout = real_stdout
print(f"{TIMER.ended - TIMER.started}")

0:19:46.056057

for mode in training_loop.history.modes:
print(mode)
print(training_loop.history.metrics_for_mode(mode))

eval
['metrics/TripletLoss']
train
['metrics/TripletLoss', 'training/gradients_l2', 'training/learning_rate', 'training/loss', 'training/steps per second', 'training/weights_l2']

• Plotting the Metrics

Note: As of February 2021, the version of trax on pypi doesn't have a history attribute - to get it you have to install the code from the github repository.

frame = pandas.DataFrame(training_loop.history.get("eval", "metrics/TripletLoss"), columns="Batch TripletLoss".split())

minimum = frame.loc[frame.TripletLoss.idxmin()]
vline = holoviews.VLine(minimum.Batch).opts(opts.VLine(color=PLOT.red))
hline = holoviews.HLine(minimum.TripletLoss).opts(opts.HLine(color=PLOT.red))
line = frame.hvplot(x="Batch", y="TripletLoss").opts(opts.Curve(color=PLOT.blue))

plot = (line * hline * vline).opts(
width=PLOT.width, height=PLOT.height,
title="Evaluation Batch Triplet Loss",
)
output = Embed(plot=plot, file_name="evaluation_triplet_loss")()

print(output)


It looks the loss is stabilizing. If it doesn't perform well I'll re-train it.

#### Trial Three

Let's see if the continues going down.

train_steps = 2000
training_loop = train_model(siamese.model, triplet_loss_layer, lr_schedule, steps_per_checkpoint=5)

real_stdout = sys.stdout

TIMER.emit = False
TIMER.start()
with TemporaryFile("w") as temp_file:
sys.stdout = temp_file
training_loop.run(train_steps)
TIMER.stop()
sys.stdout = real_stdout
print(f"{TIMER.ended - TIMER.started}")

0:17:41.167719

• Plotting the Metrics
frame = pandas.DataFrame(
training_loop.history.get("eval", "metrics/TripletLoss"),
columns="Batch TripletLoss".split())

minimum = frame.loc[frame.TripletLoss.idxmin()]
vline = holoviews.VLine(minimum.Batch).opts(opts.VLine(color=PLOT.red))
hline = holoviews.HLine(minimum.TripletLoss).opts(opts.HLine(color=PLOT.red))
line = frame.hvplot(x="Batch", y="TripletLoss").opts(opts.Curve(color=PLOT.blue))

plot = (line * hline * vline).opts(
width=PLOT.width, height=PLOT.height,
title="Evaluation Batch Triplet Loss (Third Run)",
)
output = Embed(plot=plot, file_name="evaluation_triplet_loss_third")()

print(output)


It looks like it stopped improving. Probably time to stop.

# Siamese Networks: Hard Negative Mining

## Hard Negative Mining

Now we will now implement the TripletLoss. Loss is composed of two terms. One term utilizes the mean of all the non duplicates, the second utilizes the closest negative. Our loss expression is then:

\begin{align} \mathcal{Loss_1(A,P,N)} &=\max \left( -cos(A,P) + mean_{neg} +\alpha, 0\right) \\ \mathcal{Loss_2(A,P,N)} &=\max \left( -cos(A,P) + closest_{neg} +\alpha, 0\right) \\ \mathcal{Loss(A,P,N)} &= mean(Loss_1 + Loss_2) \\ \end{align}

Here is a list of things we have to do:

• As this will be run inside trax, use fastnp.xyz when using any xyz numpy function
• Use fastnp.dot to calculate the similarity matrix $v_1v_2^T$ of dimension batch_size x batch_size
• Take the score of the duplicates on the diagonal fastnp.diagonal
• Use the trax functions fastnp.eye and fastnp.maximum for the identity matrix and the maximum.

### Imports

# python
from functools import partial

# pypi
from trax.fastmath import numpy as fastnp
from trax import layers

import jax
import numpy


## Implementation

### More Detailed Instructions

We'll describe the algorithm using a detailed example. Below, V1, V2 are the output of the normalization blocks in our model. Here we will use a batch_size of 4 and a d_model of 3. The inputs, Q1, Q2 are arranged so that corresponding inputs are duplicates while non-corresponding entries are not. The outputs will have the same pattern.

This testcase arranges the outputs, v1,v2, to highlight different scenarios. Here, the first two outputs V1[0], V2[0] match exactly - so the model is generating the same vector for Q1[0] and Q2[0] inputs. The second outputs differ, circled in orange, we set, V2[1] is set to match V2[**2**], simulating a model which is generating very poor results. V1[3] and V2[3] match exactly again while V1[4] and V2[4] are set to be exactly wrong - 180 degrees from each other, circled in blue.

#### Cosine Similarity

The first step is to compute the cosine similarity matrix or score in the code. This is $V_1 V_2^T$ which is generated with fastnp.dot.

The clever arrangement of inputs creates the data needed for positive and negative examples without having to run all pair-wise combinations. Because Q1[n] is a duplicate of only Q2[n], other combinations are explicitly created negative examples or Hard Negative examples. The matrix multiplication efficiently produces the cosine similarity of all positive/negative combinations as shown above on the left side of the diagram. 'Positive' are the results of duplicate examples and 'negative' are the results of explicitly created negative examples. The results for our test case are as expected, V1[0]V2[0] match producing '1' while our other 'positive' cases (in green) don't match well, as was arranged. The V2[2] was set to match V1[3] producing a poor match at score[2,2] and an undesired 'negative' case of a '1' shown in grey.

With the similarity matrix (score) we can begin to implement the loss equations. First, we can extract $\cos(A,P)$ by utilizing fastnp.diagonal. The goal is to grab all the green entries in the diagram above. This is positive in the code.

#### Closest Negative

Next, we will create the closest_negative. This is the nonduplicate entry in V2 that is closest (has largest cosine similarity) to an entry in V1. Each row, n, of score represents all comparisons of the results of Q1[n] vs Q2[x] within a batch. A specific example in our testcase is row score[2,:]. It has the cosine similarity of V1[2] and V2[x]. The closest_negative, as was arranged, is V2[2] which has a score of 1. This is the maximum value of the 'negative' entries (blue entries in the diagram).

To implement this, we need to pick the maximum entry on a row of score, ignoring the 'positive'/green entries. To avoid selecting the 'positive'/green entries, we can make them larger negative numbers. Multiply fastnp.eye(batch_size) with 2.0 and subtract it out of scores. The result is negative_without_positive. Now we can use fastnp.max, row by row (axis=1), to select the maximum which is closest_negative.

#### Mean Negative

Next, we'll create mean_negative. As the name suggests, this is the mean of all the 'negative'/blue values in score on a row by row basis. We can use fastnp.eye(batch_size) and a constant, this time to create a mask with zeros on the diagonal. Element-wise multiply this with score to get just the 'negative values. This is negative_zero_on_duplicate in the code. Compute the mean by using fastnp.sum on negative_zero_on_duplicate for axis=1 and divide it by (batch_size - 1) . This is mean_negative.

Now, we can compute loss using the two equations above and fastnp.maximum. This will form triplet_loss1 and triplet_loss2.

triple_loss is the fastnp.mean of the sum of the two individual losses.

def TripletLossFn(v1: numpy.ndarray, v2: numpy.ndarray,
margin: float=0.25) -> jax.interpreters.xla.DeviceArray:
"""Custom Loss function.

Args:
v1 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q1.
v2 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q2.
margin (float, optional): Desired margin. Defaults to 0.25.

Returns:
jax.interpreters.xla.DeviceArray: Triplet Loss.
"""
# use fastnp to take the dot product of the two batches (don't forget to transpose the second argument)
scores = fastnp.dot(v1, v2.T)
# calculate new batch size
batch_size = len(scores)
# use fastnp to grab all postive =diagonal= entries in =scores=
positive = fastnp.diagonal(scores)  # the positive ones (duplicates)
# multiply =fastnp.eye(batch_size)= with 2.0 and subtract it out of =scores=
negative_without_positive = scores - (fastnp.eye(batch_size) * 2.0)
# take the row by row =max= of =negative_without_positive=.
# Hint: negative_without_positive.max(axis = [?])
closest_negative = fastnp.max(negative_without_positive, axis=1)
# subtract =fastnp.eye(batch_size)= out of 1.0 and do element-wise multiplication with =scores=
negative_zero_on_duplicate = (1.0 - fastnp.eye(batch_size)) * scores
# use =fastnp.sum= on =negative_zero_on_duplicate= for =axis=1= and divide it by =(batch_size - 1)=
mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1)/(batch_size - 1)
# compute =fastnp.maximum= among 0.0 and =A=
# A = subtract =positive= from =margin= and add =closest_negative=
triplet_loss1 = fastnp.maximum(0, margin - positive + closest_negative)
# compute =fastnp.maximum= among 0.0 and =B=
# B = subtract =positive= from =margin= and add =mean_negative=
triplet_loss2 = fastnp.maximum(0, (margin - positive) + mean_negative)
# add the two losses together and take the =fastnp.mean= of it
triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
return triplet_loss

v1 = numpy.array([[0.26726124, 0.53452248, 0.80178373],[0.5178918 , 0.57543534, 0.63297887]])
v2 = numpy.array([[ 0.26726124,  0.53452248,  0.80178373],[-0.5178918 , -0.57543534, -0.63297887]])
triplet_loss = TripletLossFn(v2, v1)
print(f"Triplet Loss: {triplet_loss}")

assert triplet_loss == 0.5

Triplet Loss: 0.5


To make a layer out of a function with no trainable variables, use tl.Fn.

from functools import partial
def TripletLoss(margin=0.25):
triplet_loss_fn = partial(TripletLossFn, margin=margin)
return layers.Fn('TripletLoss', triplet_loss_fn)


## Bundle It Up

Unfortunately trax does some kind of weirdness where it counts the arguments of the things you use as layers, so class-based stuff won't work (because it counts the self argument, giving it too many to expect). There might be a way to work around this, but it doesn't appear to be documented so this has to be done with only functions. That's not bad, it's just unexpected (and not well documented).

### Imports

# python
from functools import partial

# from pypi
from trax.fastmath import numpy as fastmath_numpy
from trax import layers

import attr
import jax
import numpy
import trax


### Triplet Loss

def triplet_loss(v1: numpy.ndarray,
v2: numpy.ndarray, margin: float=0.25)-> jax.interpreters.xla.DeviceArray:
"""Calculates the triplet loss

Args:
v1: normalized batch for question 1
v2: normalized batch for question 2

Returns:
triplet loss
"""
scores = fastmath_numpy.dot(v1, v2.T)
batch_size = len(scores)
positive = fastmath_numpy.diagonal(scores)
negative_without_positive = scores - (fastmath_numpy.eye(batch_size) * 2.0)
closest_negative = fastmath_numpy.max(negative_without_positive, axis=1)
negative_zero_on_duplicate = (1.0 - fastmath_numpy.eye(batch_size)) * scores
mean_negative = fastmath_numpy.sum(negative_zero_on_duplicate, axis=1)/(batch_size - 1)
triplet_loss1 = fastmath_numpy.maximum(0, margin - positive + closest_negative)
triplet_loss2 = fastmath_numpy.maximum(0, (margin - positive) + mean_negative)
return fastmath_numpy.mean(triplet_loss1 + triplet_loss2)


### Triplet Loss Layer

Another not well documented limitation is that the function you create the layer from isn't allowed to take have default values, so if we want to allow the margin to have a default, we have to use partial to set the value before creating the layer…

def triplet_loss_layer(margin: float=0.25) -> layers.Fn:
"""Converts the triplet_loss function to a trax layer"""
with_margin = partial(triplet_loss, margin=margin)
return layers.Fn("TripletLoss", with_margin)


### Check It Out

from neurotic.nlp.siamese_networks import triplet_loss_layer

layer = triplet_loss_layer()
print(type(layer))

<class 'trax.layers.base.PureLayer'>


# Siamese Networks: Defining the Model

## Understanding the Siamese Network

A Siamese network is a neural network which uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.

You get the question embedding, run it through an LSTM layer, normalize $v_1$ and $v_2$, and finally use a triplet loss (explained below) to get the corresponding cosine similarity for each pair of questions. As usual, you will start by importing the data set. The triplet loss makes use of a baseline (anchor) input that is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized. In math equations, you are trying to maximize the following.

$\mathcal{L}(A, P, N)=\max \left(\|\mathrm{f}(A)-\mathrm{f}(P)\|^{2}-\|\mathrm{f}(A)-\mathrm{f}(N)\|^{2}+\alpha, 0\right)$

A is the anchor input, for example $q1_1$, $P$ the duplicate input, for example, $q2_1$, and $N$ the negative input (the non duplicate question), for example $q2_2$. $\alpha$ is a margin; you can think about it as a safety net, or by how much you want to push the duplicates from the non duplicates.

### Imports

# from pypi
import trax.fastmath.numpy as fastnp
import trax.layers as tl
# This Project


### Set Up

loader = DataLoader()



## Implementation

To implement this model, you will be using trax. Concretely, you will be using the following functions.

• tl.Serial: Combinator that applies layers serially (by function composition) allows you set up the overall structure of the feedforward. docs / source code
• You can pass in the layers as arguments to Serial, separated by commas.
• For example: tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))
• tl.Embedding: Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding. docs / source code
• tl.Embedding(vocab_size, d_feature).
• vocab_size is the number of unique words in the given vocabulary.
• d_feature is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
• tl.LSTM The LSTM layer. It leverages another Trax layer called LSTMCell. The number of units should be specified and should match the number of elements in the word embedding. docs / source code
• tl.LSTM(n_units) Builds an LSTM layer of n_units.
• tl.Mean: Computes the mean across a desired axis. Mean uses one tensor axis to form groups of values and replaces each group with the mean value of that group. docs / source code
• tl.Mean(axis=1) mean over columns.
• tl.Fn Layer with no weights that applies the function f, which should be specified using a lambda syntax. docs / source code
• x -> This is used for cosine similarity.
• tl.Fn('Normalize', lambda x: normalize(x)) Returns a layer with no weights that applies the function f
• tl.parallel: It is a combinator layer (like Serial) that applies a list of layers in parallel to its inputs. docs / source code
def Siamese(vocab_size=len(loader.vocabulary), d_model=128, mode='train'):
"""Returns a Siamese model.

Args:
vocab_size (int, optional): Length of the vocabulary. Defaults to len(vocab).
d_model (int, optional): Depth of the model. Defaults to 128.
mode (str, optional): 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to 'train'.

Returns:
trax.layers.combinators.Parallel: A Siamese model.
"""

def normalize(x):  # normalizes the vectors to have L2 norm 1
return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))

q_processor = tl.Serial(  # Processor will run on Q1 and Q2.
tl.Embedding(vocab_size, d_model), # Embedding layer
tl.LSTM(d_model), # LSTM layer
tl.Mean(axis=1), # Mean over columns
tl.Fn("Normalize", normalize)  # Apply normalize function
)  # Returns one vector of shape [batch_size, d_model].

# Run on Q1 and Q2 in parallel.
model = tl.Parallel(q_processor, q_processor)
return model


### Check the Model

model = Siamese()
print(model)

Parallel_in2_out2[
Serial[
Embedding_77068_128
LSTM_128
Mean
Normalize
]
Serial[
Embedding_77068_128
LSTM_128
Mean
Normalize
]
]


## Bundle It Up

<<imports>>

<<constants>>

<<normalize>>

<<siamese-network>>

<<the-processor>>

<<the-model>>


### Imports

# python
from collections import namedtuple

# pypi
from trax import layers
from trax.fastmath import numpy as fastmath_numpy

import attr
import numpy
import trax


### Constants

Axis = namedtuple("Axis", ["columns", "last"])
Constants = namedtuple("Constants", ["model_depth", "axis"])

AXIS = Axis(1, -1)

CONSTANTS = Constants(128, AXIS)


### Normalize

def normalize(x: numpy.ndarray) -> numpy.ndarray:
"""Normalizes the vectors to have L2 norm 1

Args:
x: the array of vectors to normalize

Returns:
normalized version of x
"""
return x/fastmath_numpy.sqrt(fastmath_numpy.sum(x**2,
axis=CONSTANTS.axis.last,
keepdims=True))


### The Siamese Model

@attr.s(auto_attribs=True)
class SiameseModel:
"""The Siamese network model

Args:
vocabulary_size: number of tokens in the vocabulary
model_depth: depth of our embedding layer
mode: train|eval|predict
"""
vocabulary_size: int
model_depth: int=CONSTANTS.model_depth
mode: str="train"
_processor: trax.layers.combinators.Serial=None
_model: trax.layers.combinators.Parallel=None


#### The Processor

@property
def processor(self) -> trax.layers.Serial:
"""The Question Processor"""
if self._processor is None:
self._processor = layers.Serial(
layers.Embedding(self.vocabulary_size, self.model_depth),
layers.LSTM(self.model_depth),
layers.Mean(axis=CONSTANTS.axis.columns),
layers.Fn("Normalize", normalize)
)
return self._processor


#### The Model

@property
def model(self) -> trax.layers.Parallel:
"""The Siamese Model"""
if self._model is None:
processor = layers.Serial(
layers.Embedding(self.vocabulary_size, self.model_depth),
layers.LSTM(self.model_depth),
layers.Mean(axis=CONSTANTS.axis.columns),
layers.Fn("Normalize", normalize)
)

self._model = layers.Parallel(processor, processor)
return self._model


### Check It Out

from neurotic.nlp.siamese_networks import SiameseModel

print(model.model)

Parallel_in4_out2[
Serial_in2[
Embedding_77068_128
LSTM_128
Mean
Normalize_in2
]
Serial_in2[
Embedding_77068_128
LSTM_128
Mean
Normalize_in2
]
]


# Siamese Networks: The Data Generator

## Beginning

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. If you were to use stochastic gradient descent with one example at a time, it would take you forever to build a model. In this example, we show you how you can build a data generator that takes in $Q1$ and $Q2$ and returns a batch of size batch_size in the following format $([q1_1, q1_2, q1_3, ...]$, $[q2_1, q2_2,q2_3, ...])$. The tuple consists of two arrays and each array has batch_size questions. Again, $q1_i$ and $q2_i$ are duplicates, but they are not duplicates with any other elements in the batch.

The iterator that we're going to create returns a pair of arrays of questions.

We'll implement the data generator below. Here are some things we will need.

• While true loop.
• if index > len_Q1=, set the idx to $0$.
• The generator should return shuffled batches of data. To achieve this without modifying the actual question lists, a list containing the indexes of the questions is created. This list can be shuffled and used to get random batches everytime the index is reset.
• Append elements of $Q1$ and $Q2$ to input1 and input2 respectively.
• if len(input1) = batch_size=, determine max_len as the longest question in input1 and input2. Ceil max_len to a power of $2$ (for computation purposes) using the following command: max_len = 2**int(np.ceil(np.log2(max_len))).
• Pad every question by vocab['<PAD>'] until you get the length max_len.
• Use yield to return input1, input2.
• Don't forget to reset input1, input2 to empty arrays at the end (data generator resumes from where it last left).

### Imports

# python
import random

# pypi
import numpy

# this project


### Set Up

#### Our Data

loader = DataLoader()



#### The Idiotic Names

np = numpy
rnd = random


## Middle

def data_generator(Q1:list, Q2:list, batch_size: int,
"""Generator function that yields batches of data

Args:
Q1 (list): List of transformed (to tensor) questions.
Q2 (list): List of transformed (to tensor) questions.
batch_size (int): Number of elements per batch.
shuffle (bool, optional): If the batches should be randomnized or not. Defaults to True.

Yields:
tuple: Of the form (input1, input2) with types (numpy.ndarray, numpy.ndarray)
NOTE: input1: inputs to your model [q1a, q2a, q3a, ...] i.e. (q1a,q1b) are duplicates
input2: targets to your model [q1b, q2b,q3b, ...] i.e. (q1a,q2i) i!=a are not duplicates
"""

input1 = []
input2 = []
idx = 0
len_q = len(Q1)
question_indexes = list(range(len_q))

if shuffle:
rnd.shuffle(question_indexes)
while True:
if idx >= len_q:
# if idx is greater than or equal to len_q, set idx accordingly
# (Hint: look at the instructions above)
idx = 0
# shuffle to get random batches if shuffle is set to True
if shuffle:
rnd.shuffle(question_indexes)

# get questions at the question_indexes[idx] position in Q1 and Q2
q1 = Q1[question_indexes[idx]]
q2 = Q2[question_indexes[idx]]

# increment idx by 1
idx += 1
# append q1
input1.append(q1)
# append q2
input2.append(q2)
if len(input1) == batch_size:
# determine max_len as the longest question in input1 & input 2
# Hint: use the max function.
# take max of input1 & input2 and then max out of the two of them.
max_len = max(max(len(question) for question in input1),
max(len(question) for question in input2))
print(max_len)
# pad to power-of-2 (Hint: look at the instructions above)
max_len = 2**int(np.ceil(np.log2(max_len)))
print(max_len)
b1 = []
b2 = []
for q1, q2 in zip(input1, input2):
q1 = q1 + ((max_len - len(q1)) * [pad])
q2 = q2 + ((max_len - len(q2)) * [pad])
# append q1
b1.append(q1)
# append q2
b2.append(q2)
# use b1 and b2
yield np.array(b1), np.array(b2)
# reset the batches
input1, input2 = [], []  # reset the batches


### Try It Out

rnd.seed(34)
batch_size = 2
generator = data_generator(data.train.question_one, data.train.question_two, batch_size)
result_1, result_2 = next(generator)
print(f"First questions  : \n{result_1}\n")
print(f"Second questions : \n{result_2}")

11
16
First questions  :
[[  34   37   13   50  536 1303 6428   25  924  157   28    1    1    1
1    1]
[  34   95  573 1444 2343   28    1    1    1    1    1    1    1    1
1    1]]

Second questions :
[[  34   37   13  575 1303 6428   25  924  157   28    1    1    1    1
1    1]
[   9  151   25  573 5642   28    1    1    1    1    1    1    1    1
1    1]]


## Bundling It Up

### Imports

# python
from collections import namedtuple

import random

# pypi
import attr
import numpy

# this project


### The Data Generator

@attr.s(auto_attribs=True)
class DataGenerator:
"""Batch Generator for Quora question dataset

Args:
question_one: tensorized question 1
question_two: tensorized question 2
batch_size: size of generated batches
shuffle: whether to shuffle the questions around
"""
question_one: numpy.ndarray
question_two: numpy.ndarray
batch_size: int
shuffle: bool=True
_batch: iter=None


#### The Generator Definition

def data_generator(self):
"""Generator function that yields batches of data

Yields:
tuple: (batch_question_1, batch_question_2)
"""
index = 0
number_of_questions = len(self.question_one)
question_indexes = list(range(number_of_questions))

if self.shuffle:
random.shuffle(question_indexes)

while True:
if index >= number_of_questions:
index = 0
if self.shuffle:
random.shuffle(question_indexes)

index += 1

max_len = max(max(len(question) for question in unpadded_1),
max_len = 2**int(numpy.ceil(numpy.log2(max_len)))
return


#### The Generator

@property
def batch(self):
"""The generator instance"""
if self._batch is None:
self._batch = self.data_generator()
return self._batch


#### The Iter Method

def __iter__(self):
return self


#### The Next Method

def __next__(self):
return next(self.batch)


### Check It Out

from neurotic.nlp.siamese_networks import DataGenerator, DataLoader

generator = DataGenerator(data.train.question_one, data.train.question_two, batch_size=2)

random.seed(34)
batch_size = 2
result_1, result_2 = next(generator)
print(f"First questions  : \n{result_1}\n")
print(f"Second questions : \n{result_2}")

First questions  :
[[  34   37   13   50  536 1303 6428   25  924  157   28    1    1    1
1    1]
[  34   95  573 1444 2343   28    1    1    1    1    1    1    1    1
1    1]]

Second questions :
[[  34   37   13  575 1303 6428   25  924  157   28    1    1    1    1
1    1]
[   9  151   25  573 5642   28    1    1    1    1    1    1    1    1
1    1]]


# Siamese Networks: The Data

## Transforming the Data

We'll will be using the Quora question answer dataset to build a model that could identify similar questions. This is a useful task because you don't want to have several versions of the same question posted. Several times when teaching I end up responding to similar questions on piazza, or on other community forums. This data set has been labeled for you. Run the cell below to import some of the packages you will be using.

### Imports

# python
from collections import defaultdict
from pathlib import Path

import os

# pypi
from expects import expect, contain_exactly

import nltk
import numpy
import pandas

# my other stuff
from graeae import Timer


### Set Up

#### The Timer

TIMER = Timer()


#### NLTK

We need to download the punkt data to be able to tokenize our sentences.

nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /home/neurotic/data/datasets/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### The Training Data

load_dotenv("posts/nlp/.env")
path = Path(os.environ["QUORA_TRAIN"]).expanduser()


## Middle

### Inspecting the Data

rows, columns = data.shape
print(f"Rows: {rows:,} Columns: {columns}")

Rows: 404,290 Columns: 6

print(data.iloc[0])

id                                                              0
qid1                                                            1
qid2                                                            2
question1       What is the step by step guide to invest in sh...
question2       What is the step by step guide to invest in sh...
is_duplicate                                                    0
Name: 0, dtype: object


So, you can see that we have a row ID, followed by IDs for each of the questions, followed by the question-pair, and finally a label of whether the two questions are duplicates (1) or not (0).

### Train Test Split

For the moment we're going to use a straight splitting of the dataset, rather than using a shuffled split. We're going for a roughly 75-25 split.

training_size = 3 * 10**5
training_data = data.iloc[:training_size]
testing_data = data.iloc[training_size:]

assert len(training_data) == training_size


Since the data set is large, we'll delete the original pandas DataFrame to save memory.

del(data)


### Filtering Out Non-Duplicates

We are going to use only the question pairs that are duplicate to train the model.

We build two batches as input for the Siamese network and we assume that question $q1_i$ (question i in the first batch) is a duplicate of $q2_i$ (question i in the second batch), but all other questions in the second batch are not duplicates of $q1_i$.

The test set uses the original pairs of questions and the status describing if the questions are duplicates.

duplicates = training_data[training_data.is_duplicate==1]
example = duplicates.iloc[0]
print(example.question1)
print(example.question2)
print(example.is_duplicate)

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?
1

print(f"There are {len(duplicates):,} duplicates for the training data.")

There are 111,473 duplicates for the training data.


We only took the duplicated questions for training our model because the data generator will produce batches $([q1_1, q1_2, q1_3, ...]$, [q2_1, q2_2,q2_3, …])\) where $q1_i$ and $q2_k$ are duplicate if and only if $i = k$.

### Encoding the Words

Now we'll encode each word of the selected duplicate pairs with an index. Given a question, we can then just encode it as a list of numbers.

First we'll tokenize the questions using nltk.word_tokenize.

We'll also need a python default dictionary which later, during inference, assigns the value 0 to all Out Of Vocabulary (OOV) words.

#### Build the Vocabulary

We'll start by resetting the index. Pandas preserves the original index, but since we dropped the non-duplicates it's missing rows so resetting it will start it at 0 again. By default it normally keeps the original index as a column, but passing in drop=True prevents that.

reindexed = duplicates.reset_index(drop=True)


Now we'll build the vocabulary by mapping the words to the "index" for that word in the dictionary.

vocabulary = defaultdict(lambda: 0)

with TIMER:
question_1_train = duplicates.question1.apply(nltk.word_tokenize)
question_2_train = duplicates.question2.apply(nltk.word_tokenize)
combined = question_1_train + question_2_train
for index, tokens in combined.iteritems():
tokens = (token for token in set(tokens) if token not in vocabulary)
for token in tokens:
vocabulary[token] = len(vocabulary) + 1
print(f"There are {len(vocabulary):,} words in the vocabulary.")

Started: 2021-01-30 18:36:26.773827
Ended: 2021-01-30 18:36:46.522680
Elapsed: 0:00:19.748853
There are 36,278 words in the vocabulary.


Some example vocabulary words.

print(vocabulary['<PAD>'])
print(vocabulary['Astrology'])
print(vocabulary['Astronomy'])

1
7
0


The last 0 indicates that, while Astrology is in our vocabulary, Astronomy is not. Peculiar.

Now we'll set up the test arrays. One of the Question 1 entries is empty so we'll have to drop it first.

testing_data = testing_data[~testing_data.question1.isna()]

with TIMER:
Q1_test_words = testing_data.question1.apply(nltk.word_tokenize)
Q2_test_words = testing_data.question2.apply(nltk.word_tokenize)

Started: 2021-01-30 16:43:08.891230
Ended: 2021-01-30 16:43:27.954422
Elapsed: 0:00:19.063192


### Converting a question to a tensor

We'll now convert every question to a tensor, or an array of numbers, using the vocabulary we built above.

def words_to_index(words):
return [vocabulary[word] for word in words]

Q1_train = question_1_train.apply(words_to_index)
Q2_train = question_2_train.apply(words_to_index)

Q1_test = Q1_test_words.apply(words_to_index)
Q2_test = Q2_test_words.apply(words_to_index)

print('first question in the train set:\n')
print(question_1_train.iloc[0], '\n')
print('encoded version:')
print(Q1_train.iloc[0],'\n')

first question in the train set:

['Astrology', ':', 'I', 'am', 'a', 'Capricorn', 'Sun', 'Cap', 'moon', 'and', 'cap', 'rising', '...', 'what', 'does', 'that', 'say', 'about', 'me', '?']

encoded version:
[7, 6, 17, 26, 22, 12, 15, 14, 2, 24, 16, 19, 31, 8, 9, 21, 25, 3, 23, 29]


print(f"{len(vocabulary):,}")

77,068


### Validation Set

You will now split your train set into a training/validation set so that you can use it to train and evaluate your Siamese model.

TRAINING_FRACTION = 0.8
cut_off = int(len(question_1_train) * TRAINING_FRACTION)
train_question_1, train_question_2 = Q1_train[:cut_off], Q2_train[:cut_off]
validation_question_1, validation_question_2 = Q1_train[cut_off: ], Q2_train[cut_off:]
print(f"Number of duplicate questions: {len(Q1_train):,}")
print(f"The length of the training set is:  {len(train_question_1):,}")
print(f"The length of the validation set is: {len(validation_question_1):,}")

Number of duplicate questions: 111,473
The length of the training set is:  89,178
The length of the validation set is: 22,295


## Bundling It Up

### Imports

# python
from collections import defaultdict, namedtuple
from pathlib import Path

import os

# pypi
from pathlib import Path

import attr
import nltk
import pandas


### NLTK Setup

nltk.download("punkt")


### Constants and Data

Tokens = namedtuple("Tokens", ["unknown", "padding", "padding_token"])
TOKENS = Tokens(unknown=0,

Question = namedtuple("Question", ["question_one", "question_two"])
Data = namedtuple("Data", ["train", "validate", "test", "y_test"])


### The Data Tokenizer

@attr.s(auto_attribs=True)
class DataTokenizer:
"""Converts questions to tokens

Args:
data: the data-frame to tokenize
"""
data: pandas.DataFrame
_question_1: pandas.Series=None
_question_2: pandas.Series=None


#### Question 1

@property
def question_1(self) -> pandas.Series:
"""tokenized version of question 1"""
if self._question_1 is None:
self._question_1 = self.data.question1.apply(nltk.word_tokenize)
return self._question_1


#### Question 2

@property
def question_2(self) -> pandas.Series:
"""tokenized version of question 2"""
if self._question_2 is None:
self._question_2 = self.data.question2.apply(nltk.word_tokenize)
return self._question_2


### The Data Tensorizer

@attr.s(auto_attribs=True)
class DataTensorizer:
"""Convert tokenized words to numbers

Args:
vocabulary: word to integer mapping
question_1: data to convert
question_2: other data to convert
"""
vocabulary: dict
question_1: pandas.Series
question_2: pandas.Series
_tensorized_1: pandas.Series=None
_tensorized_2: pandas.Series=None


#### Tensorized 1

@property
def tensorized_1(self) -> pandas.Series:
"""numeric version of question 1"""
if self._tensorized_1 is None:
self._tensorized_1 = self.question_1.apply(self.to_index)
return self._tensorized_1


#### Tensorized 2

@property
def tensorized_2(self) -> pandas.Series:
"""Numeric version of question 2"""
if self._tensorized_2 is None:
self._tensorized_2 = self.question_2.apply(self.to_index)
return self._tensorized_2


#### To Index

def to_index(self, words: list) -> list:
"""Convert list of words to list of integers"""
return [self.vocabulary[word] for word in words]


### The Data Transformer

@attr.s(auto_attribs=True)

Args:
env: The path to the .env file with the raw-data path
key: key in the environment with the path to the data
train_validation_size: number of entries for the training/validation set
training_fraction: what fraction of the training/valdiation set for training
"""
env: str="posts/nlp/.env"
key: str="QUORA_TRAIN"
train_validation_size: int=300000
training_fraction: float=0.8
_data_path: Path=None
_raw_data: pandas.DataFrame=None
_training_data: pandas.DataFrame=None
_testing_data: pandas.DataFrame=None
_duplicates: pandas.DataFrame=None
_tokenized_train: DataTokenizer=None
_tokenized_test: DataTokenizer=None
_vocabulary: dict=None
_tensorized_train: DataTensorizer=None
_tensorized_test: DataTensorizer=None
_test_labels: pandas.Series=None
_data: namedtuple=None


#### Data Path

@property
def data_path(self) -> Path:
"""Where to find the data file"""
if self._data_path is None:
self._data_path = Path(os.environ[self.key]).expanduser()
return self._data_path


#### Data

@property
def raw_data(self) -> pandas.DataFrame:
"""The raw-data"""
if self._raw_data is None:
self._raw_data = self._raw_data[~self._raw_data.question1.isna()]
self._raw_data = self._raw_data[~self._raw_data.question2.isna()]
return self._raw_data


#### Training Data

@property
def training_data(self) -> pandas.DataFrame:
"""The training/validation part of the data"""
if self._training_data is None:
self._training_data = self.raw_data.iloc[:self.train_validation_size]
return self._training_data


#### Testing Data

@property
def testing_data(self) -> pandas.DataFrame:
"""The testing portion of the raw data"""
if self._testing_data is None:
self._testing_data = self.raw_data.iloc[self.train_validation_size:]
return self._testing_data


#### Duplicates

@property
def duplicates(self) -> pandas.DataFrame:
"""training-validation data that has duplicate questions"""
if self._duplicates is None:
self._duplicates = self.training_data[self.training_data.is_duplicate==1]
return self._duplicates


#### Train Tokenizer

@property
def tokenized_train(self) -> DataTokenizer:
"""training tokenized
"""
if self._tokenized_train is None:
self._tokenized_train = DataTokenizer(self.duplicates)
return self._tokenized_train


#### Test Tokenizer

@property
def tokenized_test(self) -> DataTokenizer:
"""Test Tokenizer"""
if self._tokenized_test is None:
self._tokenized_test = DataTokenizer(
self.testing_data)
return self._tokenized_test


#### The Vocabulary

@property
def vocabulary(self) -> dict:
"""The token:index map"""
if self._vocabulary is None:
self._vocabulary = defaultdict(lambda: TOKENS.unknown)
combined = (self.tokenized_train.question_1
+ self.tokenized_train.question_2)
for index, tokens in combined.iteritems():
tokens = (token for token in set(tokens)
if token not in self._vocabulary)
for token in tokens:
self._vocabulary[token] = len(self._vocabulary) + 1
return self._vocabulary


#### Tensorized Train

@property
def tensorized_train(self) -> DataTensorizer:
"""Tensorizer for the training data"""
if self._tensorized_train is None:
self._tensorized_train = DataTensorizer(
vocabulary=self.vocabulary,
question_1 = self.tokenized_train.question_1,
question_2 = self.tokenized_train.question_2,
)
return self._tensorized_train


#### Tensorized Test

@property
def tensorized_test(self) -> DataTensorizer:
"""Tensorizer for the testing data"""
if self._tensorized_test is None:
self._tensorized_test = DataTensorizer(
vocabulary = self.vocabulary,
question_1 = self.tokenized_test.question_1,
question_2 = self.tokenized_test.question_2,
)
return self._tensorized_test


#### Test Labels

@property
def test_labels(self) -> pandas.Series:
"""The labels for the test data

0 : not duplicate questions
1 : is duplicate
"""
if self._test_labels is None:
self._test_labels = self.testing_data.is_duplicate
return self._test_labels


#### The Final Data

@property
def data(self) -> namedtuple:
"""The final tensorized data"""
if self._data is None:
cut_off = int(len(self.duplicates) * self.training_fraction)
self._data = Data(
train=Question(
question_one=self.tensorized_train.tensorized_1[:cut_off].to_numpy(),
question_two=self.tensorized_train.tensorized_2[:cut_off].to_numpy()),
validate=Question(
question_one=self.tensorized_train.tensorized_1[cut_off:].to_numpy(),
question_two=self.tensorized_train.tensorized_2[cut_off:].to_numpy()),
test=Question(
question_one=self.tensorized_test.tensorized_1.to_numpy(),
question_two=self.tensorized_test.tensorized_2.to_numpy()),
y_test=self.test_labels.to_numpy(),
)
return self._data


### Test It Out

from neurotic.nlp.siamese_networks import DataLoader


print(f"Number of duplicate questions: {len(loader.duplicates):,}")
print(f"The length of the training set is:  {len(data.train.question_one):,}")
print(f"The length of the validation set is: {len(data.validate.question_one):,}")

Number of duplicate questions: 111,474
The length of the training set is:  89,179
The length of the validation set is: 22,295

print('first question in the train set:\n')
print('encoded version:')
print(data.train.question_one[0],'\n')
expect(data.train.question_one[0]).to(contain_exactly(*Q1_train.iloc[0]))

first question in the train set:

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
encoded version:
[7, 6, 17, 26, 22, 12, 15, 14, 2, 24, 16, 19, 31, 8, 9, 21, 25, 3, 23, 29]


assert len(loader.vocabulary) == len(vocabulary)

77,068


# Siamese Networks: Duplicate Questions

## Beginning

In this series of posts we will:

• Understand how the triplet loss works
• Understand how to evaluate accuracy
• Use cosine similarity between the model's outputted vectors
• Use the data generator to get batches of questions
• Make predictions using the own model

# Evaluating a Siamese Model

## Beginning

We are going to learn how to evaluate a Siamese model using the accuracy metric.

### Imports

# python
from pathlib import Path
import os

# from pypi

import trax.fastmath.numpy as trax_numpy


### Set Up

load_dotenv("posts/nlp/.env")
PREFIX = "SIAMESE_"


## Middle

### Data

We're going to use some pre-made data rather than start from scratch to (hopefully) make the actual evaluation clearer.

These are the data structures:

• q1: vector with dimension (batch_size X max_length) containing first questions to compare in the test set.
• q2: vector with dimension (batch_size X max_length) containing second questions to compare in the test set.

Notice that for each pair of vectors within a batch $([q1_1, q1_2, q1_3, \ldots]$, $[q2_1, q2_2,q2_3, ...])$ $q1_i$ is associated with $q2_k$.

• y_test: 1 if $q1_i$ and $q2_k$ are duplicates, 0 otherwise.
• v1: output vector from the model's prediction associated with the first questions.
• v2: output vector from the model's prediction associated with the second questions.
print(f'q1 has shape: {q1.shape} \n\nAnd it looks like this: \n\n {q1}\n\n')

q1 has shape: (512, 64)

And it looks like this:

[[ 32  38   4 ...   1   1   1]
[ 30 156  78 ...   1   1   1]
[ 32  38   4 ...   1   1   1]
...
[ 32  33   4 ...   1   1   1]
[ 30 156 317 ...   1   1   1]
[ 30 156   6 ...   1   1   1]]


The ones on the right side are padding values.

print(f'q2 has shape: {q2.shape} \n\nAnd looks like this: \n\n {q2}\n\n')

q2 has shape: (512, 64)

And looks like this:

[[   30   156    78 ...     1     1     1]
[  283   156    78 ...     1     1     1]
[   32    38     4 ...     1     1     1]
...
[   32    33     4 ...     1     1     1]
[   30   156    78 ...     1     1     1]
[   30   156 10596 ...     1     1     1]]

print(f'y_test has shape: {y_test.shape} \n\nAnd looks like this: \n\n {y_test}\n\n')

y_test has shape: (512,)

And looks like this:

[0 1 1 0 0 0 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0
0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0
0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0 0
0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 1
1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1
0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1
1 0 1 1 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 1 0 0
0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0
0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0
1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1
0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1
1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

print(f'v1 has shape: {v1.shape} \n\nAnd looks like this: \n\n {v1}\n\n')

v1 has shape: (512, 128)

And looks like this:

[[ 0.01273625 -0.1496373  -0.01982759 ...  0.02205012 -0.00169148
-0.01598107]
[-0.05592084  0.05792497 -0.02226785 ...  0.08156938 -0.02570007
-0.00503111]
[ 0.05686752  0.0294889   0.04522024 ...  0.03141788 -0.08459651
-0.00968536]
...
[ 0.15115018  0.17791134  0.02200656 ... -0.00851707  0.00571415
-0.00431194]
[ 0.06995274  0.13110274  0.0202337  ... -0.00902792 -0.01221745
0.00505962]
[-0.16043712 -0.11899089 -0.15950686 ...  0.06544471 -0.01208312
-0.01183368]]

print(f'v2 has shape: {v2.shape} \n\nAnd looks like this: \n\n {v2}\n\n')

v2 has shape: (512, 128)

And looks like this:

[[ 0.07437647  0.02804951 -0.02974014 ...  0.02378932 -0.01696189
-0.01897198]
[ 0.03270066  0.15122835 -0.02175895 ...  0.00517202 -0.14617395
0.00204823]
[ 0.05635608  0.05454165  0.042222   ...  0.03831453 -0.05387777
-0.01447786]
...
[ 0.04727105 -0.06748016  0.04194937 ...  0.07600753 -0.03072828
0.00400715]
[ 0.00269269  0.15222628  0.01714724 ...  0.01482705 -0.0197884
0.01389528]
[-0.15475044 -0.15718803 -0.14732707 ...  0.04299919 -0.01070975
-0.01318042]]


### Calculating the accuracy

You will calculate the accuracy by iterating over the test set and checking if the model predicts right or wrong.

You will also need the batch size and the threshold that will determine if two questions are the same or not.

Note: A higher threshold means that only very similar questions will be considered as the same question.

batch_size = 512
threshold = 0.7
batch = range(batch_size)


The process is pretty straightforward:

• Iterate over each one of the elements in the batch
• Compute the cosine similarity between the predictions
• For computing the cosine similarity, the two output vectors should have been normalized using L2 normalization meaning their magnitude will be 1. This has been taken care off by the Siamese network. Hence the cosine similarity here is just dot product between two vectors. You can check by implementing the usual cosine similarity formula and check if this holds or not.
• Determine if this value is greater than the threshold (If it is, consider the two questions as the same and return 1 else 0)
• Compare against the actual target and if the prediction matches, add 1 to the accuracy (increment the correct prediction counter)
• Divide the accuracy by the number of processed elements
correct = 0

for row in batch:
similarity = trax_numpy.dot(v1[row], v2[row])
similar_enough = similarity > threshold
correct += (y_test[element] == similar_enough)

accuracy = correct / batch_size

print(f"The accuracy of the model is: {accuracy:0.4f}.")

The accuracy of the model is: 0.6621.


# Modified Triplet Loss

## Beginning

We'll be looking at how to calculate the full triplet loss as well as a matrix of similarity scores.

### Background

This is the original triplet loss function:

$\mathcal{L_\mathrm{Original}} = \max{(\mathrm{s}(A,N) -\mathrm{s}(A,P) +\alpha, 0)}$

It can be improved by including the mean negative and the closest negative, to create a new full loss function. The inputs are the Anchor $\mathrm{A}$, Positive $\mathrm{P}$ and Negative $\mathrm{N}$.

\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{Full}} &= \mathcal{L_\mathrm{1}} + \mathcal{L_\mathrm{2}}\\ \end{align}

### Imports

# from pypi
import numpy


## Middle

### Similarity Scores

The first step is to calculate the matrix of similarity scores using cosine similarity so that you can look up $\mathrm{s}(A,P)$, $\mathrm{s}(A,N)$ as needed for the loss formulas.

#### Two Vectors

First, this is how to calculate the similarity score, using cosine similarity, for 2 vectors.

$\mathrm{s}(v_1,v_2) = \mathrm{cosine \ similarity}(v_1,v_2) = \frac{v_1 \cdot v_2}{||v_1||~||v_2||}$

#### Similarity score

def cosine_similarity(v1: numpy.ndarray, v2: numpy.ndarray) -> float:
"""Calculates the cosine similarity between two vectors

Args:
v1: first vector
v2: vector to compare to v1

Returns:
the cosine similarity between v1 and v2
"""
numerator = numpy.dot(v1, v2)
denominator = numpy.sqrt(numpy.dot(v1, v1)) * numpy.sqrt(numpy.dot(v2, v2))
return numerator / denominator

• Similar vectors
v1 = numpy.array([1, 2, 3], dtype=float)
v2 = numpy.array([1, 2, 3.5])

print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")

cosine similarity : 0.9974

• Identical Vectors
v2 = v1
print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")

cosine similarity : 1.0000

• Opposite Vectors
v2 = -v1
print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")

cosine similarity : -1.0000

• Dissimilar Vectors
v2 = numpy.array([0,-42,1])
print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")

cosine similarity : -0.5153


### Two Batches of Vectors

Now let's look at how to calculate the similarity scores, using cosine similarity, for 2 batches of vectors. These are rows of individual vectors, just like in the example above, but stacked vertically into a matrix. They would look like the image below for a batch size (row count) of 4 and embedding size (column count) of 5.

The data is setup so that $v_{1\_1}$ and $v_{2\_1}$ represent duplicate inputs, but they are not duplicates with any other rows in the batch. This means $v_{1\_1}$ and $v_{2\_1}$ (green and green) have more similar vectors than say $v_{1\_1}$ and $v_{2\_2}$ (green and magenta).

We'll use two different methods for calculating the matrix of similarities from 2 batches of vectors.

The Input data.

v1_1 = numpy.array([1, 2, 3])
v1_2 = numpy.array([9, 8, 7])
v1_3 = numpy.array([-1, -4, -2])
v1_4 = numpy.array([1, -7, 2])
v1 = numpy.vstack([v1_1, v1_2, v1_3, v1_4])
print("v1 :")
print(v1, "\n")
v2_1 = v1_1 + numpy.random.normal(0, 2, 3)  # add some noise to create approximate duplicate
v2_2 = v1_2 + numpy.random.normal(0, 2, 3)
v2_3 = v1_3 + numpy.random.normal(0, 2, 3)
v2_4 = v1_4 + numpy.random.normal(0, 2, 3)
v2 = numpy.vstack([v2_1, v2_2, v2_3, v2_4])
print("v2 :")
print(v2, "\n")

v1 :
[[ 1  2  3]
[ 9  8  7]
[-1 -4 -2]
[ 1 -7  2]]

v2 :
[[ 1.34263076  1.18510671  1.04373534]
[ 8.96692933  6.50763316  7.03243982]
[-3.4497247  -6.08808183 -4.54327564]
[-0.77144774 -9.08449817  4.4633513 ]]


For this to work the batch sizes must match.

assert len(v1) == len(v2)


Now let's look at the similarity scores.

• Option 1 : nested loops and the cosine similarity function
batch_size, columns = v1.shape
scores_1 = numpy.zeros([batch_size, batch_size])

rows, columns = scores_1.shape

for row in range(rows):
for column in range(columns):
scores_1[row, column] = cosine_similarity(v1[row], v2[column])

print("Option 1 : Loop")
print(scores_1)

Option 1 : Loop
[[ 0.88245143  0.87735873 -0.93717609 -0.14613242]
[ 0.99999485  0.99567656 -0.95998199 -0.34214656]
[-0.86016573 -0.81584759  0.96484391  0.60584372]
[-0.31943701 -0.23354642  0.49063636  0.96181686]]

• Option 2 : Vector Normalization and the Dot Product
def norm(x: numpy.ndarray) -> numpy.ndarray:
"""Normalize x"""
return x / numpy.sqrt(numpy.sum(x * x, axis=1, keepdims=True))

scores_2 = numpy.dot(norm(v1), norm(v2).T)

print("Option 2 : Vector Norm & dot product")
print(scores_2)

Option 2 : Vector Norm & dot product
[[ 0.88245143  0.87735873 -0.93717609 -0.14613242]
[ 0.99999485  0.99567656 -0.95998199 -0.34214656]
[-0.86016573 -0.81584759  0.96484391  0.60584372]
[-0.31943701 -0.23354642  0.49063636  0.96181686]]



#### Check

Let's make sure we get the same answer in both cases.

assert numpy.allclose(scores_1, scores_2)


### Hard Negative Mining

Now we'll calculate the mean negative $mean\_neg$ and the closest negative $close\_neg$ used in calculating $\mathcal{L_\mathrm{1}}$ and $\mathcal{L_\mathrm{2}}$.

\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \end{align}

We'll do this using the matrix of similarity scores for a batch size of 4. The diagonal of the matrix contains all the $\mathrm{s}(A,P)$ values, similarities from duplicate question pairs (aka Positives). This is an important attribute for the calculations to follow.

#### Mean Negative

mean_neg is the average of the off diagonals, the $\mathrm{s}(A,N)$ values, for each row.

#### Closest Negative

closest_neg is the largest off diagonal value, $\mathrm{s}(A,N)$, that is smaller than the diagonal $\mathrm{s}(A,P)$ for each row.

similarity_scores = numpy.array(
[
[0.9, -0.8, 0.3, -0.5],
[-0.4, 0.5, 0.1, -0.1],
[0.3, 0.1, -0.4, -0.8],
[-0.5, -0.2, -0.7, 0.5],
]
)


#### Positives

All the s(A,P) values are similarities from duplicate question pairs (aka Positives). These are along the diagonal.

sim_ap = numpy.diag(similarity_scores)
print("s(A, P) :\n")
print(numpy.diag(sim_ap))

s(A, P) :

[[ 0.9  0.   0.   0. ]
[ 0.   0.5  0.   0. ]
[ 0.   0.  -0.4  0. ]
[ 0.   0.   0.   0.5]]


#### Negatives

All the s(A,N) values are similarities of the non duplicate question pairs (aka Negatives). These are in the cells not on the diagonal.

sim_an = similarity_scores - numpy.diag(sim_ap)
print("s(A, N) :\n")
print(sim_an)

s(A, N) :

[[ 0.  -0.8  0.3 -0.5]
[-0.4  0.   0.1 -0.1]
[ 0.3  0.1  0.  -0.8]
[-0.5 -0.2 -0.7  0. ]]


#### Mean negative

This is the average of the s(A,N) values for each row.

batch_size = similarity_scores.shape[0]
mean_neg = numpy.sum(sim_an, axis=1, keepdims=True) / (batch_size - 1)
print("mean_neg :\n")
print(mean_neg)

mean_neg :

[[-0.33333333]
[-0.13333333]
[-0.13333333]
[-0.46666667]]


#### Closest negative

These are the Max s(A,N) that is <= s(A,P) for each row.

mask_1 = numpy.identity(batch_size) == 1            # mask to exclude the diagonal
mask_2 = sim_an > sim_ap.reshape(batch_size, 1)  # mask to exclude sim_an > sim_ap
sim_an_masked = numpy.copy(sim_an)         # create a copy to preserve sim_an

print("Closest Negative :\n")
print(closest_neg)

Closest Negative :

[[ 0.3]
[ 0.1]
[-0.8]
[-0.2]]


### The Loss Functions

The last step is to calculate the loss functions.

\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{Full}} &= \mathcal{L_\mathrm{1}} + \mathcal{L_\mathrm{2}}\\ \end{align}

The Alpha margin.

alpha = 0.25


#### Modified triplet loss

loss_1 = numpy.maximum(mean_neg - sim_ap.reshape(batch_size, 1) + alpha, 0)
loss_2 = numpy.maximum(closest_neg - sim_ap.reshape(batch_size, 1) + alpha, 0)
loss_full = loss_1 + loss_2


#### Cost

cost = numpy.sum(loss_full)
print("Loss Full :\n")
print(loss_full)
print(f"\ncost : {cost:.3f}")

Loss Full :

[[0.        ]
[0.        ]
[0.51666667]
[0.        ]]

cost : 0.517