Siamese Networks: Evaluating the Model
Table of Contents
Evaluating the Siamese Network
Force CPU Use
For some reason the model eats up more and more memory on the GPU until it runs out. Seems like a memory leak. Anyway, for reasons that I don't know, the way that tensorflow tells you to disable using the GPU doesn't work (it's in the second code block) so to get this to work I have to essentially break the CUDA settings.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
This is the way they tell you to do it.
import tensorflow
tensorflow.config.set_visible_devices([], "GPU")
Imports
# python
from collections import namedtuple
from pathlib import Path
# pypi
import numpy
import trax
# this project
from neurotic.nlp.siamese_networks import (
DataGenerator,
DataLoader,
SiameseModel,
)
# other
from graeae import Timer
Set Up
The Data
loader = DataLoader()
data = loader.data
vocabulary_length = len(loader.vocabulary)
y_test = data.y_test
testing = data.test
del(loader)
del(data)
The Timer
TIMER = Timer()
The Model
siamese = SiameseModel(vocabulary_length)
path = Path("~/models/siamese_networks/model.pkl.gz").expanduser()
weights = siamese.model.init_from_file(path, weights_only=True)
Classify
To determine the accuracy of the model, we will utilize the test set that was configured earlier. While in training we used only positive examples, the test data, Q1_test, Q2_test and y_test, is setup as pairs of questions, some of which are duplicates some are not.
This routine will run all the test question pairs through the model, compute the cosine simlarity of each pair, threshold it and compare the result to y_test - the correct response from the data set. The results are accumulated to produce an accuracy.
Instructions
- Loop through the incoming data in batch_size chunks
- Use the data generator to load q1, q2 a batch at a time. Don't forget to set shuffle=False!
- copy a batch_size chunk of y into y_test
- compute v1, v2 using the model
- for each element of the batch
- compute the cos similarity of each pair of entries, v1[j],v2[j]
- determine if d > threshold
- increment accuracy if that result matches the expected results (y_test[j])
- compute the final accuracy and return
Outcome = namedtuple("Outcome", ["accuracy", "true_positive",
"true_negative", "false_positive",
"false_negative"])
def classify(data_generator: iter,
y: numpy.ndarray,
threshold: float,
model: trax.layers.Parallel):
"""Function to test the accuracy of the model.
Args:
data_generator: batch generator,
y: Array of actual target.
threshold: minimum distance to be considered the same
model: The Siamese model.
Returns:
float: Accuracy of the model.
"""
accuracy = 0
true_positive = false_positive = true_negative = false_negative = 0
batch_start = 0
for batch_one, batch_two in data_generator:
batch_size = len(batch_one)
batch_stop = batch_start + batch_size
if batch_stop >= len(y):
break
batch_labels = y[batch_start: batch_stop]
vector_one, vector_two = model((batch_one, batch_two))
batch_start = batch_stop
for row in range(batch_size):
similarity = numpy.dot(vector_one[row], vector_two[row].T)
same_question = int(similarity > threshold)
correct = same_question == batch_labels[row]
if same_question:
if correct:
true_positive += 1
else:
false_positive += 1
else:
if correct:
true_negative += 1
else:
false_negative += 1
accuracy += int(correct)
return Outcome(accuracy=accuracy/len(y),
true_positive = true_positive,
true_negative = true_negative,
false_positive = false_positive,
false_negative = false_negative)
batch_size = 512
data_generator = DataGenerator(testing.question_one, testing.question_two,
batch_size=batch_size,
shuffle=False)
with TIMER:
outcome = classify(
data_generator=data_generator,
y=y_test,
threshold=0.7,
model=siamese.model
)
print(f"Outcome: {outcome}")
Started: 2021-02-10 21:42:27.320674 Ended: 2021-02-10 21:47:57.411380 Elapsed: 0:05:30.090706 Outcome: Outcome(accuracy=0.6546453536874203, true_positive=16439, true_negative=51832, false_positive=14425, false_negative=21240)
So, is that good or not? It might be more useful to look at the rates.
print(f"Accuracy: {outcome.accuracy:0.2f}")
true_positive = outcome.true_positive
false_negative = outcome.false_negative
true_negative = outcome.true_negative
false_positive = outcome.false_positive
print(f"True Positive Rate: {true_positive/(true_positive + false_negative): 0.2f}")
print(f"True Negative Rate: {true_negative/(true_negative + false_positive):0.2f}")
print(f"Precision: {outcome.true_positive/(true_positive + false_positive):0.2f}")
print(f"False Negative Rate: {false_negative/(false_negative + true_positive):0.2f}")
print(f"False Positive Rate: {false_positive/(false_positive + true_negative): 0.2f}")
Accuracy: 0.65 True Positive Rate: 0.44 True Negative Rate: 0.78 Precision: 0.53 False Negative Rate: 0.56 False Positive Rate: 0.22
So, it was better at recognizing questions that were different. We could probably fiddle with the threshold to make it more one way or the other, if we needed to.