Data Generators
Table of Contents
Data generators
In Python, a generator is a function that behaves like an iterator. It will return the next item. In many AI applications, it is advantageous to have a data generator to handle loading and transforming data for different applications.
In the following example, we use a set of samples a
, to derive a new set of samples, with more elements than the original set.
Note: Pay attention to the use of list lines_index
and variable index
to traverse the original list.
Imports
# python
from itertools import cycle
import random
# pypi
from expects import be_true, expect
import numpy
Examples
An Example of a Circular List
This is sort of a fake generator that uses indices to make it look like it's infinite.
a = [1, 2, 3, 4]
a_size = len(a)
end = 10
index = 0 # similar to index in data_generator below
for i in range(10): # `b` is longer than `a` forcing a wrap
print(a[index], end=",")
index = (index + 1) % a_size
1,2,3,4,1,2,3,4,1,2,
There's a python built-in that's equivalent to this called cycle.
index = 1
for item in cycle(a):
print(item, end=",")
if index == end:
break
index += 1
1,2,3,4,1,2,3,4,1,2,
And if you wanted to make your own generator version you could use the yield keyword.
def infinite(a: list):
"""Generates elements infinitely
Args:
a: list
Yields:
elements of a
"""
index = 0
end = len(a)
while True:
yield a[index]
index = (index + 1) % end
return
a_infinite = infinite(a)
for index, item in enumerate(a_infinite):
if index == end:
break
print(item, end=",")
1,2,3,4,1,2,3,4,1,2,
Shuffling the data order
In the next example, we will do the same as before, but shuffling the order of the elements in the output list. Note that here, our strategy of traversing using lines_index
and index
becomes very important, because we can simulate a shuffle in the input data, without doing that in reality.
a = tuple((1, 2, 3, 4))
a_size = len(a)
data_indices = list(range(a_size))
print(f"Original order of indices: {data_indices}")
Original order of indices: [0, 1, 2, 3]
If we shuffle the index_list we can change the order of our circular list without modifying the order or our original data.
random.shuffle(data_indices) # Shuffle the order
print(f"Shuffled order of indices: {data_indices}")
Shuffled order of indices: [3, 0, 1, 2]
Now we create a list of random values from a that is larger than a.
b = [a[index] for index in data_indices]
b_size = 10
print(f"New value order for first batch: {b}")
batch_counter = 1
data_index = 0
for b_index in range(len(b), b_size):
if data_index == 0:
batch_counter += 1
random.shuffle(data_indices)
print(f"\nShuffled Indexes for Batch No. {batch_counter} :{data_indices}")
print(f"Values for Batch No.{batch_counter} :{[a[index] for index in data_indices]}")
b.append(a[data_indices[data_index]])
data_index = (data_index + 1) % a_size
print(f"\nFinal value of b: {b} with {len(b)} items")
New value order for first batch: [1, 3, 4, 2] Shuffled Indexes for Batch No. 2 :[1, 3, 2, 0] Values for Batch No.2 :[2, 4, 3, 1] Shuffled Indexes for Batch No. 3 :[0, 3, 2, 1] Values for Batch No.3 :[1, 4, 3, 2] Final value of b: [1, 3, 4, 2, 2, 4, 3, 1, 1, 4] with 10 items
Note: We call an epoch each time that an algorithm passes over all the training examples. Shuffling the examples for each epoch is known to reduce variance, making the models more general and overfit less.
Using sample. instead.
data_indices = random.sample(range(a_size), k=a_size)
b = [a[index] for index in data_indices]
b_size = 10
print(f"New value order for first batch: {b}")
batch_counter = 1
data_index = 0
for b_index in range(len(b), b_size):
if data_index == 0:
batch_counter += 1
data_indices = random.sample(data_indices, k=a_size)
print(f"\nShuffled Indexes for Batch No. {batch_counter} :{data_indices}")
print(f"Values for Batch No.{batch_counter} :{[a[index] for index in data_indices]}")
b.append(a[data_indices[data_index]])
data_index = (data_index + 1) % a_size
print(f"\nFinal value of b: {b} with {len(b)} items")
New value order for first batch: [1, 4, 3, 2] Shuffled Indexes for Batch No. 2 :[3, 0, 1, 2] Values for Batch No.2 :[4, 1, 2, 3] Shuffled Indexes for Batch No. 3 :[2, 0, 1, 3] Values for Batch No.3 :[3, 1, 2, 4] Final value of b: [1, 4, 3, 2, 4, 1, 2, 3, 3, 1] with 10 items
Data Generator Function
This will be a data generator function that takes in batch_size, x, y shuffle
where x could be a large list of samples, and y is a list of the tags associated with those samples. Return a subset of those inputs in a tuple of two arrays (X,Y)
. Each is an array of dimension (batch_size
). If shuffle=True
, the data will be traversed in a random form.
Which runs continuously in the fashion of generators, pausing when yielding the next values. We will generate a batch_size
output on each pass of this loop.
It has an inner loop that stores the data samples in temporary lists (X, Y)
which will be included in the next batch.
There are three slightly out-of-the-ordinary features to this function.
- The first is the use of a list of a predefined size to store the data for each batch. Using a predefined size list reduces the computation time if the elements in the array are of a fixed size, like numbers. If the elements are of different sizes, it is better to use an empty array and append one element at a time during the loop.
- The second is tracking the current location in the incoming lists of samples. Generators variables hold their values between invocations, so we create an
index
variable, initialize to zero, and increment by one for each sample included in a batch. However, we do not use theindex
to access the positions of the list of sentences directly. Instead, we use it to select one index from a list of indexes. In this way, we can change the order in which we traverse our original list, keeping untouched our original list. - The third also relates to wrapping. Because
batch_size
and the length of the input lists are not aligned, gathering abatch_size
group of inputs may involve wrapping back to the beginning of the input loop. In our approach, it is just enough to reset theindex
to 0. We can re-shuffle the list of indexes to produce different batches each time.
def data_generator(batch_size: int, data_x: list, data_y: list, shuffle: bool=True):
"""Infinite batch generator
Args:
batch_size: the size to make batches
data_x: list containing samples
data_y: list containing labels
shuffle: Shuffle the data order
Yields:
a tuple containing 2 elements:
X - list of dim (batch_size) of samples
Y - list of dim (batch_size) of labels
"""
amount_of_data = len(data_x)
assert amount_of_data == len(data_y)
def re_shuffle(x):
k = len(x)
return random.sample(range(k), k=k)
shuffler = re_shuffle if shuffle else lambda x: list(range(len(x)))
source_indices = shuffler(data_x)
source_location = 0
while True:
X = list(range(batch_size))
Y = list(range(batch_size))
for batch_location in range(batch_size):
X[batch_location] = data_x[source_indices[source_location]]
Y[batch_location] = data_y[source_indices[source_location]]
source_location = (source_location + 1) % amount_of_data
source_indices = (shuffler(data_x) if source_location == 0
else source_indices)
yield((X, Y))
return
def test_data_generator() -> None:
"""Tests the un-shuffled version of the generator
Raises:
AssertionError: some value didn't match.
"""
x = [1, 2, 3, 4]
y = [xi ** 2 for xi in x]
generator = data_generator(3, x, y, shuffle=False)
for expected in (([1, 2, 3], [1, 4, 9]),
([4, 1, 2], [16, 1, 4]),
([3, 4, 1], [9, 16, 1]),
([2, 3, 4], [4, 9, 16])):
expect(numpy.allclose(next(generator), expected)).to(be_true)
return
test_data_generator()