Data Generators

Data generators

In Python, a generator is a function that behaves like an iterator. It will return the next item. In many AI applications, it is advantageous to have a data generator to handle loading and transforming data for different applications.

In the following example, we use a set of samples a, to derive a new set of samples, with more elements than the original set.

Note: Pay attention to the use of list lines_index and variable index to traverse the original list.

Imports

# python
from itertools import cycle

import random

# pypi
from expects import be_true, expect
import numpy

Examples

An Example of a Circular List

This is sort of a fake generator that uses indices to make it look like it's infinite.

a = [1, 2, 3, 4]
a_size = len(a)
end = 10
index = 0                      # similar to index in data_generator below
for i in range(10):        # `b` is longer than `a` forcing a wrap   
    print(a[index], end=",")
    index = (index + 1) % a_size    
1,2,3,4,1,2,3,4,1,2,

There's a python built-in that's equivalent to this called cycle.

index = 1
for item in cycle(a):
    print(item, end=",")
    if index == end:
        break
    index += 1    
1,2,3,4,1,2,3,4,1,2,

And if you wanted to make your own generator version you could use the yield keyword.

def infinite(a: list):
    """Generates elements infinitely

    Args:
     a: list

    Yields:
     elements of a
    """
    index = 0
    end = len(a)
    while True:
        yield a[index]
        index = (index + 1) % end
    return

a_infinite = infinite(a)
for index, item in enumerate(a_infinite):
    if index == end:
        break
    print(item, end=",")
1,2,3,4,1,2,3,4,1,2,

Shuffling the data order

In the next example, we will do the same as before, but shuffling the order of the elements in the output list. Note that here, our strategy of traversing using lines_index and index becomes very important, because we can simulate a shuffle in the input data, without doing that in reality.

a = tuple((1, 2, 3, 4))
a_size = len(a)
data_indices = list(range(a_size))
print(f"Original order of indices: {data_indices}")
Original order of indices: [0, 1, 2, 3]

If we shuffle the index_list we can change the order of our circular list without modifying the order or our original data.

random.shuffle(data_indices) # Shuffle the order
print(f"Shuffled order of indices: {data_indices}")
Shuffled order of indices: [3, 0, 1, 2]

Now we create a list of random values from a that is larger than a.

b = [a[index] for index in data_indices]
b_size = 10

print(f"New value order for first batch: {b}")
batch_counter = 1
data_index = 0
for b_index in range(len(b), b_size):
    if data_index == 0:
        batch_counter += 1
        random.shuffle(data_indices)
        print(f"\nShuffled Indexes for Batch No. {batch_counter} :{data_indices}")
        print(f"Values for Batch No.{batch_counter} :{[a[index] for index in data_indices]}")

    b.append(a[data_indices[data_index]])
    data_index = (data_index + 1) % a_size

print(f"\nFinal value of b: {b} with {len(b)} items")
New value order for first batch: [1, 3, 4, 2]

Shuffled Indexes for Batch No. 2 :[1, 3, 2, 0]
Values for Batch No.2 :[2, 4, 3, 1]

Shuffled Indexes for Batch No. 3 :[0, 3, 2, 1]
Values for Batch No.3 :[1, 4, 3, 2]

Final value of b: [1, 3, 4, 2, 2, 4, 3, 1, 1, 4] with 10 items

Note: We call an epoch each time that an algorithm passes over all the training examples. Shuffling the examples for each epoch is known to reduce variance, making the models more general and overfit less.

Using sample. instead.

data_indices = random.sample(range(a_size), k=a_size)
b = [a[index] for index in data_indices]
b_size = 10

print(f"New value order for first batch: {b}")
batch_counter = 1
data_index = 0
for b_index in range(len(b), b_size):
    if data_index == 0:
        batch_counter += 1
        data_indices = random.sample(data_indices, k=a_size)
        print(f"\nShuffled Indexes for Batch No. {batch_counter} :{data_indices}")
        print(f"Values for Batch No.{batch_counter} :{[a[index] for index in data_indices]}")

    b.append(a[data_indices[data_index]])
    data_index = (data_index + 1) % a_size

print(f"\nFinal value of b: {b} with {len(b)} items")
New value order for first batch: [1, 4, 3, 2]

Shuffled Indexes for Batch No. 2 :[3, 0, 1, 2]
Values for Batch No.2 :[4, 1, 2, 3]

Shuffled Indexes for Batch No. 3 :[2, 0, 1, 3]
Values for Batch No.3 :[3, 1, 2, 4]

Final value of b: [1, 4, 3, 2, 4, 1, 2, 3, 3, 1] with 10 items

Data Generator Function

This will be a data generator function that takes in batch_size, x, y shuffle where x could be a large list of samples, and y is a list of the tags associated with those samples. Return a subset of those inputs in a tuple of two arrays (X,Y). Each is an array of dimension (batch_size). If shuffle=True, the data will be traversed in a random form.

Which runs continuously in the fashion of generators, pausing when yielding the next values. We will generate a batch_size output on each pass of this loop.

It has an inner loop that stores the data samples in temporary lists (X, Y) which will be included in the next batch.

There are three slightly out-of-the-ordinary features to this function.

  1. The first is the use of a list of a predefined size to store the data for each batch. Using a predefined size list reduces the computation time if the elements in the array are of a fixed size, like numbers. If the elements are of different sizes, it is better to use an empty array and append one element at a time during the loop.
  2. The second is tracking the current location in the incoming lists of samples. Generators variables hold their values between invocations, so we create an index variable, initialize to zero, and increment by one for each sample included in a batch. However, we do not use the index to access the positions of the list of sentences directly. Instead, we use it to select one index from a list of indexes. In this way, we can change the order in which we traverse our original list, keeping untouched our original list.
  3. The third also relates to wrapping. Because batch_size and the length of the input lists are not aligned, gathering a batch_size group of inputs may involve wrapping back to the beginning of the input loop. In our approach, it is just enough to reset the index to 0. We can re-shuffle the list of indexes to produce different batches each time.
def data_generator(batch_size: int, data_x: list, data_y: list, shuffle: bool=True):
    """Infinite batch generator

      Args: 
       batch_size: the size to make batches
       data_x: list containing samples
       data_y: list containing labels
       shuffle: Shuffle the data order

      Yields:
       a tuple containing 2 elements:
       X - list of dim (batch_size) of samples
       Y - list of dim (batch_size) of labels
    """
    amount_of_data = len(data_x)
    assert amount_of_data == len(data_y)

    def re_shuffle(x):
        k = len(x)
        return random.sample(range(k), k=k)

    shuffler = re_shuffle if shuffle else lambda x: list(range(len(x)))
    source_indices = shuffler(data_x)

    source_location = 0
    while True:
        X = list(range(batch_size))
        Y = list(range(batch_size))

        for batch_location in range(batch_size):                            
            X[batch_location] = data_x[source_indices[source_location]]
            Y[batch_location] = data_y[source_indices[source_location]]
            source_location = (source_location + 1) % amount_of_data
            source_indices = (shuffler(data_x) if source_location == 0
                              else source_indices)            
        yield((X, Y))
    return
def test_data_generator() -> None:
    """Tests the un-shuffled version of the generator

    Raises:
     AssertionError: some value didn't match.
    """
    x = [1, 2, 3, 4]
    y = [xi ** 2 for xi in x]

    generator = data_generator(3, x, y, shuffle=False)
    for expected in (([1, 2, 3], [1, 4, 9]),
                     ([4, 1, 2], [16, 1, 4]),
                     ([3, 4, 1], [9, 16, 1]),
                     ([2, 3, 4], [4, 9, 16])):
        expect(numpy.allclose(next(generator), expected)).to(be_true)
    return
test_data_generator()