Multi-Layer LSTM

Beginning

Imports

Python

from functools import partial
from pathlib import Path
import pickle

PyPi

import holoviews
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Others

from graeae import Timer, EmbedHoloviews

Set Up

The Timer

TIMER = Timer()

Plotting

Embed = partial(EmbedHoloviews,
                folder_path="../../files/posts/keras/multi-layer-lstm/")

The Dataset

This once again uses the IMDB dataset with 50,000 reviews. It has already been converted from strings to integers - each word is encoded as its own integer. Adding with_info=True returns an object that contains the dictionary with the word to integer mapping. Passing in imdb_reviews/subwords8k limits the vocabulary to 8,000 words.

Note: The first time you run this it will download a fairly large dataset so it might appear to hang, but after the first time it is fairly quick.

dataset, info = tensorflow_datasets.load("imdb_reviews/subwords8k",
                                         with_info=True,
                                         as_supervised=True)

Middle

Set Up the Datasets

train_dataset, test_dataset = dataset["train"], dataset["test"]
tokenizer = info.features['text'].encoder

Now we're going to shuffle and padd the data. The BUFFER_SIZE argument sets the size of the data to sample from. In this case 10,000 entries in the training set will be selected to be put in the buffer and then the "shuffle" is created by randomly selecting items from the buffer, replacing each item as it's selected until all the data has been through the buffer. The padded_batch method creates batches of consecutive data and pads them so that they are all the same shape.

The BATCH_SIZE needs to be tuned a little. If it's too big the amount of memory needed might keep the GPU from being able to use it (and it might not generalize), and if it's too small, you will take a long time to train, so you have to do a little tuning. If you train it and the GPU process percentage stays at 0, try reducing the Batch Size.

Also note that if you change the batch-size you have to go back to the previous step and re-define train_dataset and test_dataset because we alter them in the next step and re-altering them makes the shape wrong somehow.

BUFFER_SIZE = 10000
# if the batch size is too big it will run out of memory on the GPU 
# so you might have to experiment with this
BATCH_SIZE = 32

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

The Model

The previous model had one Bidirectional layer, this will add a second one.

Embedding

The Embedding layer converts our inputs of integers and converts them to vectors of real-numbers, which is a better input for a neural network.

Bidirectional

The Bidirectional layer is a wrapper for Recurrent Neural Networks.

LSTM

The LSTM layer implements Long-Short-Term Memory. The first argument is the size of the outputs. This is similar to the model that we ran previously on the same data, but it has an extra layer (so it uses more memory).

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tensorflow.keras.layers.Bidirectional(
        tensorflow.keras.layers.LSTM(64, return_sequences=True)),
    tensorflow.keras.layers.Bidirectional(
        tensorflow.keras.layers.LSTM(32)),
    tensorflow.keras.layers.Dense(64, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          523840    
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         66048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                41216     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 635,329
Trainable params: 635,329
Non-trainable params: 0
_________________________________________________________________
None

Compile It

model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])

Train the Model

ONCE_PER_EPOCH = 2
NUM_EPOCHS = 10
with TIMER:
    history = model.fit(train_dataset,
                        epochs=NUM_EPOCHS,
                        validation_data=test_dataset,
                        verbose=ONCE_PER_EPOCH)
2019-09-21 17:26:50,395 graeae.timers.timer start: Started: 2019-09-21 17:26:50.394797
I0921 17:26:50.395130 140275698915136 timer.py:70] Started: 2019-09-21 17:26:50.394797
Epoch 1/10
W0921 17:26:51.400280 140275698915136 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
782/782 - 224s - loss: 0.6486 - accuracy: 0.6039 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
782/782 - 214s - loss: 0.4941 - accuracy: 0.7661 - val_loss: 0.6706 - val_accuracy: 0.6744
Epoch 3/10
782/782 - 216s - loss: 0.4087 - accuracy: 0.8266 - val_loss: 0.4024 - val_accuracy: 0.8222
Epoch 4/10
782/782 - 217s - loss: 0.2855 - accuracy: 0.8865 - val_loss: 0.3343 - val_accuracy: 0.8645
Epoch 5/10
782/782 - 216s - loss: 0.2097 - accuracy: 0.9217 - val_loss: 0.2936 - val_accuracy: 0.8837
Epoch 6/10
782/782 - 217s - loss: 0.1526 - accuracy: 0.9467 - val_loss: 0.3188 - val_accuracy: 0.8771
Epoch 7/10
782/782 - 215s - loss: 0.1048 - accuracy: 0.9657 - val_loss: 0.3750 - val_accuracy: 0.8710
Epoch 8/10
782/782 - 216s - loss: 0.0764 - accuracy: 0.9757 - val_loss: 0.3821 - val_accuracy: 0.8762
Epoch 9/10
782/782 - 216s - loss: 0.0585 - accuracy: 0.9832 - val_loss: 0.4747 - val_accuracy: 0.8683
Epoch 10/10
782/782 - 216s - loss: 0.0438 - accuracy: 0.9883 - val_loss: 0.4441 - val_accuracy: 0.8704
2019-09-21 18:02:56,353 graeae.timers.timer end: Ended: 2019-09-21 18:02:56.353722
I0921 18:02:56.353781 140275698915136 timer.py:77] Ended: 2019-09-21 18:02:56.353722
2019-09-21 18:02:56,356 graeae.timers.timer end: Elapsed: 0:36:05.958925
I0921 18:02:56.356238 140275698915136 timer.py:78] Elapsed: 0:36:05.958925

Looking at the Performance

To get the history I had to pickle it and then copy it over to the machine with this org-notebook, so you can't just run this notebook and make it work unless everything is run on the same machine (which it wasn't).

path = Path("~/history.pkl").expanduser()
with path.open("wb") as writer:
    pickle.dump(history.history, writer)
path = Path("~/history.pkl").expanduser()
with path.open("rb") as reader:
    history = pickle.load(reader)
data = pandas.DataFrame(history)
best = data.val_loss.idxmin()
best_line = holoviews.VLine(best)
plot = (data.hvplot() * best_line).opts(
    title="Two-Layer LSTM Model",
    width=1000,
    height=800)
Embed(plot=plot, file_name="lstm_training")()

Figure Missing

It looks like the best epoch was the fifth one, with a validation loss of 0.29 and a validation accuracy of 0.88, after that it looks like it overfits. It seems that text might be a harder problem than images.