Siamese Networks: The Data

Transforming the Data

We'll will be using the Quora question answer dataset to build a model that could identify similar questions. This is a useful task because you don't want to have several versions of the same question posted. Several times when teaching I end up responding to similar questions on piazza, or on other community forums. This data set has been labeled for you. Run the cell below to import some of the packages you will be using.

Imports

# python
from collections import defaultdict
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import expect, contain_exactly

import nltk
import numpy
import pandas

# my other stuff
from graeae import Timer

Set Up

The Timer

TIMER = Timer()

NLTK

We need to download the punkt data to be able to tokenize our sentences.

nltk.download("punkt")
[nltk_data] Downloading package punkt to
[nltk_data]     /home/neurotic/data/datasets/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

The Training Data

load_dotenv("posts/nlp/.env")
path = Path(os.environ["QUORA_TRAIN"]).expanduser()
data = pandas.read_csv(path)

Middle

Inspecting the Data

rows, columns = data.shape
print(f"Rows: {rows:,} Columns: {columns}")
Rows: 404,290 Columns: 6
print(data.iloc[0])
id                                                              0
qid1                                                            1
qid2                                                            2
question1       What is the step by step guide to invest in sh...
question2       What is the step by step guide to invest in sh...
is_duplicate                                                    0
Name: 0, dtype: object

So, you can see that we have a row ID, followed by IDs for each of the questions, followed by the question-pair, and finally a label of whether the two questions are duplicates (1) or not (0).

Train Test Split

For the moment we're going to use a straight splitting of the dataset, rather than using a shuffled split. We're going for a roughly 75-25 split.

training_size = 3 * 10**5
training_data = data.iloc[:training_size]
testing_data = data.iloc[training_size:]

assert len(training_data) == training_size

Since the data set is large, we'll delete the original pandas DataFrame to save memory.

del(data)

Filtering Out Non-Duplicates

We are going to use only the question pairs that are duplicate to train the model.

We build two batches as input for the Siamese network and we assume that question \(q1_i\) (question i in the first batch) is a duplicate of \(q2_i\) (question i in the second batch), but all other questions in the second batch are not duplicates of \(q1_i\).

The test set uses the original pairs of questions and the status describing if the questions are duplicates.

duplicates = training_data[training_data.is_duplicate==1]
example = duplicates.iloc[0]
print(example.question1)
print(example.question2)
print(example.is_duplicate)
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?
1
print(f"There are {len(duplicates):,} duplicates for the training data.")
There are 111,473 duplicates for the training data.

We only took the duplicated questions for training our model because the data generator will produce batches \(([q1_1, q1_2, q1_3, ...]\), [q2_1, q2_2,q2_3, …])\) where \(q1_i\) and \(q2_k\) are duplicate if and only if \(i = k\).

Encoding the Words

Now we'll encode each word of the selected duplicate pairs with an index. Given a question, we can then just encode it as a list of numbers.

First we'll tokenize the questions using nltk.word_tokenize.

We'll also need a python default dictionary which later, during inference, assigns the value 0 to all Out Of Vocabulary (OOV) words.

Build the Vocabulary

We'll start by resetting the index. Pandas preserves the original index, but since we dropped the non-duplicates it's missing rows so resetting it will start it at 0 again. By default it normally keeps the original index as a column, but passing in drop=True prevents that.

reindexed = duplicates.reset_index(drop=True)

Now we'll build the vocabulary by mapping the words to the "index" for that word in the dictionary.

vocabulary = defaultdict(lambda: 0)
vocabulary['<PAD>'] = 1

with TIMER:
    question_1_train = duplicates.question1.apply(nltk.word_tokenize)
    question_2_train = duplicates.question2.apply(nltk.word_tokenize)
    combined = question_1_train + question_2_train
    for index, tokens in combined.iteritems():
        tokens = (token for token in set(tokens) if token not in vocabulary)
        for token in tokens:
            vocabulary[token] = len(vocabulary) + 1
print(f"There are {len(vocabulary):,} words in the vocabulary.")            
Started: 2021-01-30 18:36:26.773827
Ended: 2021-01-30 18:36:46.522680
Elapsed: 0:00:19.748853
There are 36,278 words in the vocabulary.

Some example vocabulary words.

print(vocabulary['<PAD>'])
print(vocabulary['Astrology'])
print(vocabulary['Astronomy'])
1
7
0

The last 0 indicates that, while Astrology is in our vocabulary, Astronomy is not. Peculiar.

Now we'll set up the test arrays. One of the Question 1 entries is empty so we'll have to drop it first.

testing_data = testing_data[~testing_data.question1.isna()]
with TIMER:
    Q1_test_words = testing_data.question1.apply(nltk.word_tokenize)
    Q2_test_words = testing_data.question2.apply(nltk.word_tokenize)
Started: 2021-01-30 16:43:08.891230
Ended: 2021-01-30 16:43:27.954422
Elapsed: 0:00:19.063192

Converting a question to a tensor

We'll now convert every question to a tensor, or an array of numbers, using the vocabulary we built above.

def words_to_index(words):
    return [vocabulary[word] for word in words]

Q1_train = question_1_train.apply(words_to_index)
Q2_train = question_2_train.apply(words_to_index)

Q1_test = Q1_test_words.apply(words_to_index)
Q2_test = Q2_test_words.apply(words_to_index)

print('first question in the train set:\n')
print(question_1_train.iloc[0], '\n') 
print('encoded version:')
print(Q1_train.iloc[0],'\n')
first question in the train set:

['Astrology', ':', 'I', 'am', 'a', 'Capricorn', 'Sun', 'Cap', 'moon', 'and', 'cap', 'rising', '...', 'what', 'does', 'that', 'say', 'about', 'me', '?'] 

encoded version:
[7, 6, 17, 26, 22, 12, 15, 14, 2, 24, 16, 19, 31, 8, 9, 21, 25, 3, 23, 29] 

print(f"{len(vocabulary):,}")
77,068

Validation Set

You will now split your train set into a training/validation set so that you can use it to train and evaluate your Siamese model.

TRAINING_FRACTION = 0.8
cut_off = int(len(question_1_train) * TRAINING_FRACTION)
train_question_1, train_question_2 = Q1_train[:cut_off], Q2_train[:cut_off]
validation_question_1, validation_question_2 = Q1_train[cut_off: ], Q2_train[cut_off:]
print(f"Number of duplicate questions: {len(Q1_train):,}")
print(f"The length of the training set is:  {len(train_question_1):,}")
print(f"The length of the validation set is: {len(validation_question_1):,}")
Number of duplicate questions: 111,473
The length of the training set is:  89,178
The length of the validation set is: 22,295

Bundling It Up

Imports

# python
from collections import defaultdict, namedtuple
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from pathlib import Path

import attr
import nltk
import pandas

NLTK Setup

nltk.download("punkt")

Constants and Data

Tokens = namedtuple("Tokens", ["unknown", "padding", "padding_token"])
TOKENS = Tokens(unknown=0,
                padding=1,
                padding_token="<PAD>")

Question = namedtuple("Question", ["question_one", "question_two"])
Data = namedtuple("Data", ["train", "validate", "test", "y_test"])

The Data Tokenizer

@attr.s(auto_attribs=True)
class DataTokenizer:
    """Converts questions to tokens

    Args:
     data: the data-frame to tokenize
    """
    data: pandas.DataFrame
    _question_1: pandas.Series=None
    _question_2: pandas.Series=None

Question 1

@property
def question_1(self) -> pandas.Series:
    """tokenized version of question 1"""
    if self._question_1 is None:
        self._question_1 = self.data.question1.apply(nltk.word_tokenize)
    return self._question_1

Question 2

@property
def question_2(self) -> pandas.Series:
    """tokenized version of question 2"""
    if self._question_2 is None:
        self._question_2 = self.data.question2.apply(nltk.word_tokenize)
    return self._question_2

The Data Tensorizer

@attr.s(auto_attribs=True)
class DataTensorizer:
    """Convert tokenized words to numbers

    Args:
     vocabulary: word to integer mapping
     question_1: data to convert
     question_2: other data to convert
    """
    vocabulary: dict
    question_1: pandas.Series
    question_2: pandas.Series
    _tensorized_1: pandas.Series=None
    _tensorized_2: pandas.Series=None

Tensorized 1

@property
def tensorized_1(self) -> pandas.Series:
    """numeric version of question 1"""
    if self._tensorized_1 is None:
        self._tensorized_1 = self.question_1.apply(self.to_index)
    return self._tensorized_1

Tensorized 2

@property
def tensorized_2(self) -> pandas.Series:
    """Numeric version of question 2"""
    if self._tensorized_2 is None:
        self._tensorized_2 = self.question_2.apply(self.to_index)
    return self._tensorized_2

To Index

def to_index(self, words: list) -> list:
    """Convert list of words to list of integers"""
    return [self.vocabulary[word] for word in words]

The Data Transformer

@attr.s(auto_attribs=True)
class DataLoader:
    """Loads and transforms the data

    Args:
     env: The path to the .env file with the raw-data path
     key: key in the environment with the path to the data
     train_validation_size: number of entries for the training/validation set
     training_fraction: what fraction of the training/valdiation set for training
    """
    env: str="posts/nlp/.env"
    key: str="QUORA_TRAIN"
    train_validation_size: int=300000
    training_fraction: float=0.8
    _data_path: Path=None
    _raw_data: pandas.DataFrame=None
    _training_data: pandas.DataFrame=None
    _testing_data: pandas.DataFrame=None
    _duplicates: pandas.DataFrame=None
    _tokenized_train: DataTokenizer=None
    _tokenized_test: DataTokenizer=None
    _vocabulary: dict=None
    _tensorized_train: DataTensorizer=None
    _tensorized_test: DataTensorizer=None
    _test_labels: pandas.Series=None    
    _data: namedtuple=None

Data Path

@property
def data_path(self) -> Path:
    """Where to find the data file"""
    if self._data_path is None:
        load_dotenv(self.env)
        self._data_path = Path(os.environ[self.key]).expanduser()
    return self._data_path

Data

@property
def raw_data(self) -> pandas.DataFrame:
    """The raw-data"""
    if self._raw_data is None:
        self._raw_data = pandas.read_csv(self.data_path)
        self._raw_data = self._raw_data[~self._raw_data.question1.isna()]
        self._raw_data = self._raw_data[~self._raw_data.question2.isna()]        
    return self._raw_data

Training Data

@property
def training_data(self) -> pandas.DataFrame:
    """The training/validation part of the data"""
    if self._training_data is None:
        self._training_data = self.raw_data.iloc[:self.train_validation_size]
    return self._training_data

Testing Data

@property
def testing_data(self) -> pandas.DataFrame:
    """The testing portion of the raw data"""
    if self._testing_data is None:
        self._testing_data = self.raw_data.iloc[self.train_validation_size:]
    return self._testing_data

Duplicates

@property
def duplicates(self) -> pandas.DataFrame:
    """training-validation data that has duplicate questions"""
    if self._duplicates is None:
        self._duplicates = self.training_data[self.training_data.is_duplicate==1]
    return self._duplicates

Train Tokenizer

@property
def tokenized_train(self) -> DataTokenizer:
    """training tokenized    
    """
    if self._tokenized_train is None:
        self._tokenized_train = DataTokenizer(self.duplicates)
    return self._tokenized_train

Test Tokenizer

@property
def tokenized_test(self) -> DataTokenizer:
    """Test Tokenizer"""
    if self._tokenized_test is None:
        self._tokenized_test = DataTokenizer(
            self.testing_data)
    return self._tokenized_test

The Vocabulary

@property
def vocabulary(self) -> dict:
    """The token:index map"""
    if self._vocabulary is None:
        self._vocabulary = defaultdict(lambda: TOKENS.unknown)
        self._vocabulary[TOKENS.padding_token] = TOKENS.padding
        combined = (self.tokenized_train.question_1
                    + self.tokenized_train.question_2)
        for index, tokens in combined.iteritems():
            tokens = (token for token in set(tokens)
                      if token not in self._vocabulary)
            for token in tokens:
                self._vocabulary[token] = len(self._vocabulary) + 1
    return self._vocabulary            

Tensorized Train

@property
def tensorized_train(self) -> DataTensorizer:
    """Tensorizer for the training data"""
    if self._tensorized_train is None:
        self._tensorized_train = DataTensorizer(
            vocabulary=self.vocabulary,
            question_1 = self.tokenized_train.question_1,
            question_2 = self.tokenized_train.question_2,
        )
    return self._tensorized_train

Tensorized Test

@property
def tensorized_test(self) -> DataTensorizer:
    """Tensorizer for the testing data"""
    if self._tensorized_test is None:
        self._tensorized_test = DataTensorizer(
            vocabulary = self.vocabulary,
            question_1 = self.tokenized_test.question_1,
            question_2 = self.tokenized_test.question_2,
        )
    return self._tensorized_test

Test Labels

@property
def test_labels(self) -> pandas.Series:
    """The labels for the test data

    0 : not duplicate questions
    1 : is duplicate
    """
    if self._test_labels is None:
        self._test_labels = self.testing_data.is_duplicate
    return self._test_labels

The Final Data

@property
def data(self) -> namedtuple:
    """The final tensorized data"""
    if self._data is None:
        cut_off = int(len(self.duplicates) * self.training_fraction)
        self._data = Data(
            train=Question(
                question_one=self.tensorized_train.tensorized_1[:cut_off].to_numpy(),
                question_two=self.tensorized_train.tensorized_2[:cut_off].to_numpy()),
            validate=Question(
                question_one=self.tensorized_train.tensorized_1[cut_off:].to_numpy(),
                question_two=self.tensorized_train.tensorized_2[cut_off:].to_numpy()),
            test=Question(
                question_one=self.tensorized_test.tensorized_1.to_numpy(),
                question_two=self.tensorized_test.tensorized_2.to_numpy()),
            y_test=self.test_labels.to_numpy(),
        )
    return self._data

Test It Out

from neurotic.nlp.siamese_networks import DataLoader

loader = DataLoader()

data = loader.data
print(f"Number of duplicate questions: {len(loader.duplicates):,}")
print(f"The length of the training set is:  {len(data.train.question_one):,}")
print(f"The length of the validation set is: {len(data.validate.question_one):,}")
Number of duplicate questions: 111,474
The length of the training set is:  89,179
The length of the validation set is: 22,295
print('first question in the train set:\n')
print(loader.duplicates.question1.iloc[0])
print('encoded version:')
print(data.train.question_one[0],'\n')
expect(data.train.question_one[0]).to(contain_exactly(*Q1_train.iloc[0]))
first question in the train set:

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
encoded version:
[7, 6, 17, 26, 22, 12, 15, 14, 2, 24, 16, 19, 31, 8, 9, 21, 25, 3, 23, 29] 

assert len(loader.vocabulary) == len(vocabulary)
assert not set(vocabulary) - set(loader.vocabulary)
print(f"{len(loader.vocabulary):,}")
77,068