Deep N-Grams: Batch Generation

Generating Batches of Data

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).

  • The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.

This generator returns the data in a format that you could directly use in your model when computing the feed-forward pass of your algorithm. This iterator returns a batch of lines and a per-token mask. The batch is a tuple of three parts: inputs, targets, and mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.

Imports

# python
from itertools import cycle
import random

# from pypi
from expects import be_true, expect
import trax.fastmath.numpy as numpy

# this project
from neurotic.nlp.deep_rnn.data_loader import DataLoader

Set Up

The DataLoader

data_loader = DataLoader()

Middle

The Data Generator

  • While True loop: this will yield one batch at a time.
  • if index >= num_lines, set index to 0.
  • The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.
  • if len(line) < max_length append line to cur_batch.
    • Note that a line that has length equal to max_length should not be appended to the batch.
    • This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added.
    • So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be f length 5, which is the max length.
  • if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.

Remember that when calling np you are really calling trax.fastmath.numpy which is trax’s version of numpy that is compatible with JAX. As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray.

Hints:

  • Use the line_to_tensor function above inside a list comprehension in order to pad lines with zeros.
  • Keep in mind that the length of the tensor is always 1 + the length of the original line of characters. Keep this in mind when setting the padding of zeros.

To get it to pass you'll have to pass in the to-tensor method of the DataLoader so we'll need to alias it to match their definition.

line_to_tensor = data_loader.to_tensor

Implementing the Generator

def data_generator(batch_size: int, max_length: int, data_lines: list,
                   line_to_tensor=line_to_tensor, shuffle: bool=True):
    """Generator function that yields batches of data

    Args:
       batch_size (int): number of examples (in this case, sentences) per batch.
       max_length (int): maximum length of the output tensor.
       NOTE: max_length includes the end-of-sentence character that will be added
               to the tensor.  
               Keep in mind that the length of the tensor is always 1 + the length
               of the original line of characters.
       data_lines (list): list of the sentences to group into batches.
       line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
       shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

    Yields:
       tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).
       NOTE: jax.interpreters.xla.DeviceArray is trax's version of numpy.ndarray
    """
    # initialize the index that points to the current position in the lines index array
    index = 0

    # initialize the list that will contain the current batch
    cur_batch = []

    # count the number of lines in data_lines
    num_lines = len(data_lines)

    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]

    # shuffle line indexes if shuffle is set to True
    if shuffle:
        random.shuffle(lines_index)

    while True:

        # if the index is greater or equal than to the number of lines in data_lines
        if index >= num_lines:
            # then reset the index to 0
            index = 0
            # shuffle line indexes if shuffle is set to True
            if shuffle:
                random.shuffle(lines_index)

        # get a line at the `lines_index[index]` position in data_lines
        line = data_lines[lines_index[index]]

        # if the length of the line is less than max_length
        if len(line) < max_length:
            # append the line to the current batch
            cur_batch.append(line)

        # increment the index by one
        index += 1

        # if the current batch is now equal to the desired batch size
        if len(cur_batch) == batch_size:

            batch = []
            mask = []

            # go through each line (li) in cur_batch
            for li in cur_batch:
                # convert the line (li) to a tensor of integers
                tensor = line_to_tensor(li)

                # Create a list of zeros to represent the padding
                # so that the tensor plus padding will have length `max_length`
                pad = [0] * (max_length - len(tensor))

                # combine the tensor plus pad
                tensor_pad = tensor + pad

                # append the padded tensor to the batch
                batch.append(tensor_pad)

                # A mask for  tensor_pad is 1 wherever tensor_pad is not
                # 0 and 0 wherever tensor_pad is 0, i.e. if tensor_pad is
                # [1, 2, 3, 0, 0, 0] then example_mask should be
                # [1, 1, 1, 0, 0, 0]
                # Hint: Use a list comprehension for this
                example_mask = [int(item != 0) for item in tensor_pad]
                mask.append(example_mask)

            # convert the batch (data type list) to a trax's numpy array
            batch_np_arr = numpy.array(batch)
            mask_np_arr = numpy.array(mask)


            # Yield two copies of the batch and mask.
            yield batch_np_arr, batch_np_arr, mask_np_arr

            # reset the current batch to an empty list
            cur_batch = []

Try out the data generator.

tmp_lines = ['12345678901',
             '123456789',
             '234567890',
             '345678901']

Create a generator with a batch size of 2 and a maximum length of 10.

tmp_data_gen = data_generator(batch_size=2, 
                              max_length=10, 
                              data_lines=tmp_lines,
                              shuffle=False)

Get one batch.

tmp_batch = next(tmp_data_gen)

View the batch.

print(tmp_batch)

expected = (numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))
for index, batch in enumerate(tmp_batch):
    expect(bool((batch==expected[index]).all())).to(be_true)
(DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
             [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32), DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
             [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32), DeviceArray([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))

Now that you have your generator, you can just call them and they will return tensors which correspond to your lines in Shakespeare. The first column and the second column are identical. Now you can go ahead and start building your neural network.

Repeating Batch generator

The way the iterator is currently defined, it will keep providing batches forever.

Although it is not needed, we want to show you the itertools.cycle function which is really useful when you have a generator that eventually stops.

Usually we want to cycle over the dataset multiple times during training (i.e. train for multiple epochs).

For small datasets we can use itertools.cycle to achieve this easily.

infinite_data_generator = cycle(
    data_generator(batch_size=2, max_length=10, data_lines=tmp_lines))
ten_lines = [next(infinite_data_generator) for _ in range(10)]
print(len(ten_lines))
10

Bundle It Up

As always, since this is going to be needed further down the road, I'll bundle it up.

Imports

# python
import random

# pypi
import attr
import trax.fastmath.numpy as numpy

# this project
from neurotic.nlp.deep_rnn.data_loader import DataLoader

Data Generator

@attr.s(auto_attribs=True)
class DataGenerator:
    """Generates batches

    Args:
     data: lines of data
     data_loader: something with to-tensor method
     batch_size: size of the batches
     max_length: the maximum length for a line (longer lines will be ignored)
     shuffle: whether to shuffle the data
    """
    data: list
    data_loader: DataLoader
    batch_size: int
    max_length: int
    shuffle: bool=True
    _line_count: int= None
    _line_indices: list=None
    _generator: object=None

Line Count

@property
def line_count(self) -> int:
    """Number of lines in the data"""
    if self._line_count is None:
        self._line_count = len(self.data)
    return self._line_count

Line Indices

@property
def line_indices(self) -> list:
    """Indices of the lines in the data"""
    if self._line_indices is None:
        self._line_indices = list(range(self.line_count))
    return self._line_indices

The Iterator Method

def __iter__(self):
    """A pass-through for this method"""
    return self

The Batch Generator

def data_generator(self):
    """Generator method that yields batches of data

    Yields:
     (batch, batch, mask)
    """
    index = 0
    current_batch = []
    if self.shuffle:
        random.shuffle(self.line_indices)

    while True:
        if index >= self.line_count:
            index = 0
            if self.shuffle:
                random.shuffle(self._line_indices)

        line = self.data[self.line_indices[index]]
        if len(line) < self.max_length:
            current_batch.append(line)
        index += 1

        if len(current_batch) == self.batch_size:
            batch = []
            mask = []
            for line in current_batch:
                tensor = self.data_loader.to_tensor(line)
                tensor += [0] * (self.max_length - len(tensor))
                batch.append(tensor)
                mask.append([int(item != 0) for item in tensor])

            batch = numpy.array(batch)
            yield batch, batch, numpy.array(mask)
            current_batch = []
    return

The Generator

@property
def generator(self):
    """Infinite generator of batches"""
    if self._generator is None:
        self._generator = self.data_generator()
    return self._generator

The Next Method

def __next__(self):
    """make this an iterator"""
    return next(self.generator)

Try It Out

from neurotic.nlp.deep_rnn import DataGenerator, DataLoader

loader = DataLoader()
test_lines = ['12345678901',
              '123456789',
              '234567890',
              '345678901']

generator = DataGenerator(data=test_lines,
                          data_loader=loader,
                          batch_size=2,
                          max_length=10,
                          shuffle=False)

actual = next(generator)

expected = (numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
                         [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]]),
            numpy.array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))
for index, batch in enumerate(actual):
    try:
        expect(bool((batch==expected[index]).all())).to(be_true)
    except AssertionError:
        print(batch)
        print(expected[index])
        break