IMDB Reviews Tensorflow Dataset

Beginning

We're going to use the IMDB Reviews Dataset (used in this tutorial) - a set of 50,000 movie reviews taken from the Internet Movie Database that have been classified as either positive or negative. It looks like the original source is from a page on Stanford University's web sight title Large Movie Review Dataset. The dataset seems to be widely available (the Stanford page and Kaggle for instance) but this will serve as practice for using tensorflow datasets as well.

Imports

Python

from functools import partial

PyPi

import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Graeae

from graeae import EmbedHoloviews, Timer

Set Up

Plotting

SLUG = "imdb-reviews-tensorflow-dataset"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")

Timer

TIMER = Timer()

Middle

Get the Dataset

Load It

The load function takes quite a few parameters, in this case we're just passing in three - the name of the dataset, with_info which tells it to return both a Dataset and a DatasetInfo object, and as_supervised, which tells the builder to return the Dataset as a series of (input, label) tuples.

dataset, info = tensorflow_datasets.load('imdb_reviews/subwords8k',
                                         with_info=True,
                                         as_supervised=True)

Split It

The dataset is a dict with three keys:

print(dataset.keys())
dict_keys(['test', 'train', 'unsupervised'])

As you might guess, we don't use the unsupervised key.

train_dataset, test_dataset = dataset['train'], dataset['test']

The Tokenizer

One of the advantages of using the tensorflow dataset version of this is that it comes with a pre-built tokenizer inside the DatasetInfo object.

print(info.features)
FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>),
})
tokenizer = info.features['text'].encoder
print(tokenizer)
<SubwordTextEncoder vocab_size=8185>

The tokenizer is a SubwordTextEncoder with a vocabulary size of 8,185.

Set Up Data

We're going to shuffle the training data and then add padding to both sets so theyre all the same size.

BUFFER_SIZE = 20000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

The Model

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tensorflow.keras.layers.Bidirectional(tensorflow.keras.layers.LSTM(64)),
    tensorflow.keras.layers.Dense(64, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          523840    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 598,209
Trainable params: 598,209
Non-trainable params: 0
_________________________________________________________________

Compile It

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Train It

EPOCHS = 10
SILENT = 0
ONCE_PER_EPOCH = 2
with TIMER:
    history = model.fit(train_dataset,
                        epochs=EPOCHS,
                        validation_data=test_dataset,
                        verbose=ONCE_PER_EPOCH)
2019-09-21 15:52:50,469 graeae.timers.timer start: Started: 2019-09-21 15:52:50.469787
I0921 15:52:50.469841 140086305412928 timer.py:70] Started: 2019-09-21 15:52:50.469787
Epoch 1/10
391/391 - 80s - loss: 0.3991 - accuracy: 0.8377 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
391/391 - 80s - loss: 0.3689 - accuracy: 0.8571 - val_loss: 0.4595 - val_accuracy: 0.8021
Epoch 3/10
391/391 - 80s - loss: 0.3664 - accuracy: 0.8444 - val_loss: 0.5262 - val_accuracy: 0.7228
Epoch 4/10
391/391 - 80s - loss: 0.5611 - accuracy: 0.7133 - val_loss: 0.6832 - val_accuracy: 0.6762
Epoch 5/10
391/391 - 80s - loss: 0.6151 - accuracy: 0.6597 - val_loss: 0.5164 - val_accuracy: 0.7844
Epoch 6/10
391/391 - 80s - loss: 0.3842 - accuracy: 0.8340 - val_loss: 0.4970 - val_accuracy: 0.7996
Epoch 7/10
391/391 - 80s - loss: 0.2449 - accuracy: 0.9058 - val_loss: 0.3639 - val_accuracy: 0.8463
Epoch 8/10
391/391 - 80s - loss: 0.1896 - accuracy: 0.9306 - val_loss: 0.3698 - val_accuracy: 0.8614
Epoch 9/10
391/391 - 80s - loss: 0.1555 - accuracy: 0.9456 - val_loss: 0.3896 - val_accuracy: 0.8535
Epoch 10/10
391/391 - 80s - loss: 0.1195 - accuracy: 0.9606 - val_loss: 0.4878 - val_accuracy: 0.8428
2019-09-21 16:06:09,935 graeae.timers.timer end: Ended: 2019-09-21 16:06:09.935707
I0921 16:06:09.935745 140086305412928 timer.py:77] Ended: 2019-09-21 16:06:09.935707
2019-09-21 16:06:09,938 graeae.timers.timer end: Elapsed: 0:13:19.465920
I0921 16:06:09.938812 140086305412928 timer.py:78] Elapsed: 0:13:19.465920

Plot the Performance

  • Note: This only works if your kernel is on the local machine, running it remotely gives an error, as it tries to save it on the remote machine.
data = pandas.DataFrame(history.history)
data = data.rename(columns={"loss": "Training Loss",
                            "accuracy": "Training Accuracy",
                            "val_loss": "Validation Loss",
                            "val_accuracy": "Validation Accuracy"})
plot = data.hvplot().opts(title="LSTM IMDB Performance", width=1000, height=800)
Embed(plot=plot, file_name="model_performance")()

Figure Missing

It looks like I over-trained it, as the loss is getting high. (Also note that I used this notebook to troubleshoot so there was actually one extra epoch that isn't shown).

End

Citation

This is the paper where the dataset was originally used.

  • Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).