IMDB Reviews Tensorflow Dataset
Table of Contents
Beginning
We're going to use the IMDB Reviews Dataset (used in this tutorial) - a set of 50,000 movie reviews taken from the Internet Movie Database that have been classified as either positive or negative. It looks like the original source is from a page on Stanford University's web sight title Large Movie Review Dataset. The dataset seems to be widely available (the Stanford page and Kaggle for instance) but this will serve as practice for using tensorflow datasets as well.
Imports
Python
from functools import partial
PyPi
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets
Graeae
from graeae import EmbedHoloviews, Timer
Set Up
Plotting
SLUG = "imdb-reviews-tensorflow-dataset"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")
Timer
TIMER = Timer()
Middle
Get the Dataset
Load It
The load function takes quite a few parameters, in this case we're just passing in three - the name of the dataset, with_info
which tells it to return both a Dataset and a DatasetInfo object, and as_supervised
, which tells the builder to return the Dataset
as a series of (input, label)
tuples.
dataset, info = tensorflow_datasets.load('imdb_reviews/subwords8k',
with_info=True,
as_supervised=True)
Split It
The dataset
is a dict with three keys:
print(dataset.keys())
dict_keys(['test', 'train', 'unsupervised'])
As you might guess, we don't use the unsupervised
key.
train_dataset, test_dataset = dataset['train'], dataset['test']
The Tokenizer
One of the advantages of using the tensorflow dataset version of this is that it comes with a pre-built tokenizer inside the DatasetInfo object.
print(info.features)
FeaturesDict({ 'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2), 'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>), })
tokenizer = info.features['text'].encoder
print(tokenizer)
<SubwordTextEncoder vocab_size=8185>
The tokenizer
is a SubwordTextEncoder with a vocabulary size of 8,185.
Set Up Data
We're going to shuffle the training data and then add padding to both sets so theyre all the same size.
BUFFER_SIZE = 20000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)
The Model
model = tensorflow.keras.Sequential([
tensorflow.keras.layers.Embedding(tokenizer.vocab_size, 64),
tensorflow.keras.layers.Bidirectional(tensorflow.keras.layers.LSTM(64)),
tensorflow.keras.layers.Dense(64, activation='relu'),
tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 64) 523840 _________________________________________________________________ bidirectional (Bidirectional (None, 128) 66048 _________________________________________________________________ dense (Dense) (None, 64) 8256 _________________________________________________________________ dense_1 (Dense) (None, 1) 65 ================================================================= Total params: 598,209 Trainable params: 598,209 Non-trainable params: 0 _________________________________________________________________
Compile It
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
Train It
EPOCHS = 10
SILENT = 0
ONCE_PER_EPOCH = 2
with TIMER:
history = model.fit(train_dataset,
epochs=EPOCHS,
validation_data=test_dataset,
verbose=ONCE_PER_EPOCH)
2019-09-21 15:52:50,469 graeae.timers.timer start: Started: 2019-09-21 15:52:50.469787 I0921 15:52:50.469841 140086305412928 timer.py:70] Started: 2019-09-21 15:52:50.469787 Epoch 1/10 391/391 - 80s - loss: 0.3991 - accuracy: 0.8377 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00 Epoch 2/10 391/391 - 80s - loss: 0.3689 - accuracy: 0.8571 - val_loss: 0.4595 - val_accuracy: 0.8021 Epoch 3/10 391/391 - 80s - loss: 0.3664 - accuracy: 0.8444 - val_loss: 0.5262 - val_accuracy: 0.7228 Epoch 4/10 391/391 - 80s - loss: 0.5611 - accuracy: 0.7133 - val_loss: 0.6832 - val_accuracy: 0.6762 Epoch 5/10 391/391 - 80s - loss: 0.6151 - accuracy: 0.6597 - val_loss: 0.5164 - val_accuracy: 0.7844 Epoch 6/10 391/391 - 80s - loss: 0.3842 - accuracy: 0.8340 - val_loss: 0.4970 - val_accuracy: 0.7996 Epoch 7/10 391/391 - 80s - loss: 0.2449 - accuracy: 0.9058 - val_loss: 0.3639 - val_accuracy: 0.8463 Epoch 8/10 391/391 - 80s - loss: 0.1896 - accuracy: 0.9306 - val_loss: 0.3698 - val_accuracy: 0.8614 Epoch 9/10 391/391 - 80s - loss: 0.1555 - accuracy: 0.9456 - val_loss: 0.3896 - val_accuracy: 0.8535 Epoch 10/10 391/391 - 80s - loss: 0.1195 - accuracy: 0.9606 - val_loss: 0.4878 - val_accuracy: 0.8428 2019-09-21 16:06:09,935 graeae.timers.timer end: Ended: 2019-09-21 16:06:09.935707 I0921 16:06:09.935745 140086305412928 timer.py:77] Ended: 2019-09-21 16:06:09.935707 2019-09-21 16:06:09,938 graeae.timers.timer end: Elapsed: 0:13:19.465920 I0921 16:06:09.938812 140086305412928 timer.py:78] Elapsed: 0:13:19.465920
Plot the Performance
- Note: This only works if your kernel is on the local machine, running it remotely gives an error, as it tries to save it on the remote machine.
data = pandas.DataFrame(history.history)
data = data.rename(columns={"loss": "Training Loss",
"accuracy": "Training Accuracy",
"val_loss": "Validation Loss",
"val_accuracy": "Validation Accuracy"})
plot = data.hvplot().opts(title="LSTM IMDB Performance", width=1000, height=800)
Embed(plot=plot, file_name="model_performance")()
It looks like I over-trained it, as the loss is getting high. (Also note that I used this notebook to troubleshoot so there was actually one extra epoch that isn't shown).
End
Citation
This is the paper where the dataset was originally used.
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).