Siamese Networks: New Questions

Trying New Questions

Imports

# python
from pathlib import Path

# pypi
import nltk
import numpy
import pandas
import trax

# this project
from neurotic.nlp.siamese_networks import (
    DataGenerator,
    DataLoader,
    SiameseModel,
    TOKENS,
 )

Set Up

The Data

data_generator = DataGenerator
loader = DataLoader()
vocabulary = loader.vocabulary

The Model

siamese = SiameseModel(len(vocabulary))
path = Path("~/models/siamese_networks/model.pkl.gz").expanduser()
weights = siamese.model.init_from_file(path, weights_only=True)
model = siamese.model

Implementing It

Write a function =predict=that takes in two questions, the model, and the vocabulary and returns whether the questions are duplicates (1) or not duplicates (0) given a similarity threshold.

Instructions:

  • Tokenize your question using `nltk.word_tokenize`
  • Create Q1,Q2 by encoding your questions as a list of numbers using vocab
  • pad Q1,Q2 with next(data_generator([Q1], [Q2],1,vocab['<PAD>']))
  • use model() to create v1, v2
  • compute the cosine similarity (dot product) of v1, v2
  • compute res by comparing d to the threshold
def predict(question1: str, question2: str,
            threshold: float=0.7, model: trax.layers.Parallel=model,
            vocab: dict=vocabulary, data_generator: type=data_generator,
            verbose: bool=True) -> bool:
    """Function for predicting if two questions are duplicates.

    Args:
       question1 (str): First question.
       question2 (str): Second question.
       threshold (float): Desired threshold.
       model (trax.layers.combinators.Parallel): The Siamese model.
       vocab (collections.defaultdict): The vocabulary used.
       data_generator (function): Data generator function. Defaults to data_generator.
       verbose (bool, optional): If the results should be printed out. Defaults to False.

    Returns:
       bool: True if the questions are duplicates, False otherwise.
    """
    question_one = [[vocab[word] for word in nltk.word_tokenize(question1)]]
    question_two = [[vocab[word] for word in nltk.word_tokenize(question2)]]

    questions = next(data_generator(question_one,
                                    question_two,
                                    batch_size=1))
    vector_1, vector_2 = model(questions)
    similarity = float(numpy.dot(vector_1, vector_2.T))
    same_question = similarity > threshold

    if(verbose):
        print(f"Q1  = {questions[0]}")
        print(f"Q2 = {questions[1]}")
        print(f"Similarity : {float(similarity):0.2f}")
        print(f"They are the same question: {same_question}")
    return same_question

Some Trials

print(TOKENS)
Tokens(unknown=0, padding=1, padding_token='<PAD>')

So if we see a 0 in the tokens then we know the word wasn't in the vocabulary.

question1 = "When will I see you?"
question2 = "When can I see you again?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocabulary, verbose = True)
Q1  = [[581  64  20  44  49  16   1   1]]
Q2 = [[ 581   39   20   44   49 7280   16    1]]
Similarity : 0.95
They are the same question: True
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"

predict(question1 , question2, 0.7, model, vocabulary, verbose=True)
Q1  = [[  446  1138  3159  1169    70 29016    16     1]]
Q2 = [[  446  1138    57 15302    24    70  7430    16]]
Similarity : 0.60
They are the same question: False
predict("Do cows have butts?", "Do dogs have bones?")
Q1  = [[  446  5757   216 25442    16     1     1     1]]
Q2 = [[  446   788   216 11192    16     1     1     1]]
Similarity : 0.25
They are the same question: False
predict("Do cows from Lancashire have butts?", "Do dogs have bones as big as whales?")
Q1  = [[  446  5757   125     0   216 25442    16     1     1     1     1     1
      1     1     1     1]]
Q2 = [[  446   788   216 11192   249  1124   249 30836    16     1     1     1
      1     1     1     1]]
Similarity : 0.13
They are the same question: False
predict("Can pigs fly?", "Are you my mother?")
Q1  = [[  221 14137  5750    16     1     1     1     1]]
Q2 = [[ 517   49   41 1585   16    1    1    1]]
Similarity : 0.01
They are the same question: False
predict("Shall we dance?", "Shall I fart?")
Q1  = [[19382   138  4201    16]]
Q2 = [[19382    20 18288    16]]
Similarity : 0.71
They are the same question: True

Hm… surprising that "fart" was in the data set, and it's the same as dancing.

farts = loader.training_data[loader.training_data.question2.str.contains("fart[^a-z]")]
print(len(farts))
print(farts.question2.head())
16
19820                                    Can penguins fart?
60745       How do I control a fart when I'm about to fart?
83124           What word square starts with the word fart?
96707         Which part of human body is called fart pump?
120727    Why do people fart more when they wake up in t...
Name: question2, dtype: object

Maybe I shouldn't have been surprised.

predict("Am I man or gorilla?", "Am I able to eat the pasta?")
Q1  = [[4311   20 1215   75 7438   16    1    1]]
Q2 = [[ 4311    20   461    37   922    70 14552    16]]
Similarity : 0.20
They are the same question: False

It looks like the model only looks at the first words… at least when the sentences are short.

predict("Will we return to Mars or go instead to Venus?", "Will we eat rice with plums and cherry topping?")
Q1  = [[  168   141  8303    34  6861    72  1315  4536    34 15555    16     1
      1     1     1     1]]
Q2 = [[  168   141   927  7612   121     0     9 19275     0    16     1     1
      1     1     1     1]]
Similarity : 0.67
They are the same question: False

Siamese networks are important and useful. Many times there are several questions that are already asked in quora, or other platforms and you can use Siamese networks to avoid question duplicates.