Siamese Networks: The Data Generator
Table of Contents
Beginning
Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. If you were to use stochastic gradient descent with one example at a time, it would take you forever to build a model. In this example, we show you how you can build a data generator that takes in \(Q1\) and \(Q2\) and returns a batch of size batch_size
in the following format \(([q1_1, q1_2, q1_3, ...]\), \([q2_1, q2_2,q2_3, ...])\). The tuple consists of two arrays and each array has batch_size
questions. Again, \(q1_i\) and \(q2_i\) are duplicates, but they are not duplicates with any other elements in the batch.
The iterator that we're going to create returns a pair of arrays of questions.
We'll implement the data generator below. Here are some things we will need.
- While true loop.
- if
index >
len_Q1=, set theidx
to \(0\). - The generator should return shuffled batches of data. To achieve this without modifying the actual question lists, a list containing the indexes of the questions is created. This list can be shuffled and used to get random batches everytime the index is reset.
- Append elements of \(Q1\) and \(Q2\) to
input1
andinput2
respectively. - if
len(input1) =
batch_size=, determinemax_len
as the longest question ininput1
andinput2
. Ceilmax_len
to a power of \(2\) (for computation purposes) using the following command:max_len = 2**int(np.ceil(np.log2(max_len)))
. - Pad every question by
vocab['<PAD>']
until you get the lengthmax_len
. - Use yield to return
input1, input2
. - Don't forget to reset
input1, input2
to empty arrays at the end (data generator resumes from where it last left).
Imports
# python
import random
# pypi
import numpy
# this project
from neurotic.nlp.siamese_networks import DataLoader, TOKENS
Set Up
Our Data
loader = DataLoader()
data = loader.data
The Idiotic Names
np = numpy
rnd = random
Middle
def data_generator(Q1:list, Q2:list, batch_size: int,
pad: int=1, shuffle: bool=True):
"""Generator function that yields batches of data
Args:
Q1 (list): List of transformed (to tensor) questions.
Q2 (list): List of transformed (to tensor) questions.
batch_size (int): Number of elements per batch.
pad (int, optional): Pad character from the vocab. Defaults to 1.
shuffle (bool, optional): If the batches should be randomnized or not. Defaults to True.
Yields:
tuple: Of the form (input1, input2) with types (numpy.ndarray, numpy.ndarray)
NOTE: input1: inputs to your model [q1a, q2a, q3a, ...] i.e. (q1a,q1b) are duplicates
input2: targets to your model [q1b, q2b,q3b, ...] i.e. (q1a,q2i) i!=a are not duplicates
"""
input1 = []
input2 = []
idx = 0
len_q = len(Q1)
question_indexes = list(range(len_q))
if shuffle:
rnd.shuffle(question_indexes)
while True:
if idx >= len_q:
# if idx is greater than or equal to len_q, set idx accordingly
# (Hint: look at the instructions above)
idx = 0
# shuffle to get random batches if shuffle is set to True
if shuffle:
rnd.shuffle(question_indexes)
# get questions at the `question_indexes[idx]` position in Q1 and Q2
q1 = Q1[question_indexes[idx]]
q2 = Q2[question_indexes[idx]]
# increment idx by 1
idx += 1
# append q1
input1.append(q1)
# append q2
input2.append(q2)
if len(input1) == batch_size:
# determine max_len as the longest question in input1 & input 2
# Hint: use the `max` function.
# take max of input1 & input2 and then max out of the two of them.
max_len = max(max(len(question) for question in input1),
max(len(question) for question in input2))
print(max_len)
# pad to power-of-2 (Hint: look at the instructions above)
max_len = 2**int(np.ceil(np.log2(max_len)))
print(max_len)
b1 = []
b2 = []
for q1, q2 in zip(input1, input2):
# add [pad] to q1 until it reaches max_len
q1 = q1 + ((max_len - len(q1)) * [pad])
# add [pad] to q2 until it reaches max_len
q2 = q2 + ((max_len - len(q2)) * [pad])
# append q1
b1.append(q1)
# append q2
b2.append(q2)
# use b1 and b2
yield np.array(b1), np.array(b2)
# reset the batches
input1, input2 = [], [] # reset the batches
Try It Out
rnd.seed(34)
batch_size = 2
generator = data_generator(data.train.question_one, data.train.question_two, batch_size)
result_1, result_2 = next(generator)
print(f"First questions : \n{result_1}\n")
print(f"Second questions : \n{result_2}")
11 16 First questions : [[ 34 37 13 50 536 1303 6428 25 924 157 28 1 1 1 1 1] [ 34 95 573 1444 2343 28 1 1 1 1 1 1 1 1 1 1]] Second questions : [[ 34 37 13 575 1303 6428 25 924 157 28 1 1 1 1 1 1] [ 9 151 25 573 5642 28 1 1 1 1 1 1 1 1 1 1]]
Bundling It Up
Imports
# python
from collections import namedtuple
import random
# pypi
import attr
import numpy
# this project
from neurotic.nlp.siamese_networks import DataLoader, TOKENS
The Data Generator
@attr.s(auto_attribs=True)
class DataGenerator:
"""Batch Generator for Quora question dataset
Args:
question_one: tensorized question 1
question_two: tensorized question 2
batch_size: size of generated batches
padding: token to use to pad the lists
shuffle: whether to shuffle the questions around
"""
question_one: numpy.ndarray
question_two: numpy.ndarray
batch_size: int
padding: int=TOKENS.padding
shuffle: bool=True
_batch: iter=None
The Generator Definition
def data_generator(self):
"""Generator function that yields batches of data
Yields:
tuple: (batch_question_1, batch_question_2)
"""
unpadded_1 = []
unpadded_2 = []
index = 0
number_of_questions = len(self.question_one)
question_indexes = list(range(number_of_questions))
if self.shuffle:
random.shuffle(question_indexes)
while True:
if index >= number_of_questions:
index = 0
if self.shuffle:
random.shuffle(question_indexes)
unpadded_1.append(self.question_one[question_indexes[index]])
unpadded_2.append(self.question_two[question_indexes[index]])
index += 1
if len(unpadded_1) == self.batch_size:
max_len = max(max(len(question) for question in unpadded_1),
max(len(question) for question in unpadded_2))
max_len = 2**int(numpy.ceil(numpy.log2(max_len)))
padded_1 = []
padded_2 = []
for question_1, question_2 in zip(unpadded_1, unpadded_2):
padded_1.append(question_1 + ((max_len - len(question_1)) * [self.padding]))
padded_2.append(question_2 + ((max_len - len(question_2)) * [self.padding]))
yield numpy.array(padded_1), numpy.array(padded_2)
unpadded_1, unpadded_2 = [], []
return
The Generator
@property
def batch(self):
"""The generator instance"""
if self._batch is None:
self._batch = self.data_generator()
return self._batch
The Iter Method
def __iter__(self):
return self
The Next Method
def __next__(self):
return next(self.batch)
Check It Out
from neurotic.nlp.siamese_networks import DataGenerator, DataLoader
loader = DataLoader()
data = loader.data
generator = DataGenerator(data.train.question_one, data.train.question_two, batch_size=2)
random.seed(34)
batch_size = 2
result_1, result_2 = next(generator)
print(f"First questions : \n{result_1}\n")
print(f"Second questions : \n{result_2}")
First questions : [[ 34 37 13 50 536 1303 6428 25 924 157 28 1 1 1 1 1] [ 34 95 573 1444 2343 28 1 1 1 1 1 1 1 1 1 1]] Second questions : [[ 34 37 13 575 1303 6428 25 924 157 28 1 1 1 1 1 1] [ 9 151 25 573 5642 28 1 1 1 1 1 1 1 1 1 1]]