NLP Classification Exercise

Beginning

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import hvplot.pandas
import numpy
import pandas

Others

from graeae import (CountPercentage,
                    EmbedHoloviews,
                    SubPathLoader,
                    ZipDownloader)

Set Up

The Plotting

slug = "nlp-classification-exercise"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{slug}")

The Dataset

It isn't mentioned in the notebook where the data originally came from, but it looks like it's the Sentiment140 dataset, which consists of tweets whose sentiment was inferred by emoticons in each tweet.

url = "http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip"
path = Path("~/data/datasets/texts/sentiment140/").expanduser()
download = ZipDownloader(url, path)
download()
2019-09-29 12:26:31,866 [1mZipDownloader[0m start: ([1mZipDownloader[0m) Started: 2019-09-29 12:26:31.866085
2019-09-29 12:26:31,867 [1mZipDownloader[0m download: Downloading the zip file
2019-09-29 12:26:57,309 [1mZipDownloader[0m end: ([1mZipDownloader[0m) Ended: 2019-09-29 12:26:57.309619
2019-09-29 12:26:57,311 [1mZipDownloader[0m end: ([1mZipDownloader[0m) Elapsed: 0:00:25.443534

columns = ["polarity", "tweet_id", "datetime", "query", "user", "text"]
training = pandas.read_csv(path/"training.1600000.processed.noemoticon.csv", 
                           encoding="latin-1", names=columns, header=None)
testing = pandas.read_csv(path/"testdata.manual.2009.06.14.csv", 
                           encoding="latin-1", names=columns, header=None)

Some Constants

Text = Namespace(
    embedding_dim = 100,
    max_length = 16,
    trunc_type='post',
    padding_type='post',
    oov_tok = "<OOV>",
    training_size=16000,
)

Middle

The Data

print(training.sample().iloc[0])
polarity                                                    4
tweet_id                                           2001983272
datetime                         Tue Jun 02 02:44:58 PDT 2009
query                                                NO_QUERY
user                                                    pod13
text        @wkdjellybaby to private email??? I will check...
Name: 1283818, dtype: object

CountPercentage(training.polarity)()
Value Count Percent (%)
4 800,000 50.00
0 800,000 50.00

The polarity is what might also be called the "sentiment" of the tweet - 0 mean a negative tweet and 4 means a positive tweet.

But, for our purposes, we would be better off if the positive polarity was 1, not 4, so let's convert it.

training.loc[training.polarity==4, "polarity"] = 1
counts = CountPercentage(training.polarity)()
Value Count Percent (%)
1 800,000 50.00
0 800,000 50.00

The Tokenizer

As you can see from the sample, the data is still in text form so we need to convert it to a numeric form with a Tokenizer.

tokenizer = Tokenizer()
tokenizer.fit_on_texts(training.text.values)
word_index = tokenizer.word_index
vocabulary_size = len(tokenizer.word_index)
sequences = tokenizer.texts_to_sequences(training.text.values)
padded = pad_sequences(sequences, maxlen=Text.max_length,
                       truncating=Text.trunc_type)

splits = train_test_split(
    padded, training.polarity, test_size=.2)

training_sequences, test_sequences, training_labels, test_labels = splits

GloVe

GloVe is short for Global Vectors for Word Representation. It is an unsupervised algorithm that creates vector representations for words. They have a site where you can download pre-trained models or get the code and train one yourself. We're going to use one of their pre-trained models.

path = Path("~/models/glove/").expanduser()
url = "http://nlp.stanford.edu/data/glove.6B.zip"
ZipDownloader(url, path)()
Files exist, not downloading

The GloVe data is stored as a series of space separated lines with the first column being the word that's encoded and the rest of the columns being the values for the vector. To make this work we're going to split the word off from the vector and put each into a dictionary.

embeddings = {}
with open(path/"glove.6B.100d.txt") as lines:
    for line in lines:
        tokens = line.split()
        embeddings[tokens[0]] = numpy.array(tokens[1:])
print(f"{len(embeddings):,}")
400,000

So, our vocabulary consists of 400,000 "words" (tokens is more accurate, since they also include punctuation). The problem we have to deal with next is that our data set wasn't part of the dataset used to train the embeddings, so there will probably be some tokens in our data set that aren't in the embeddings. To handle this we need to add zeroed embeddings for the extra tokens.

Rather than adding to the dict, we'll create a matrix of zeros with rows for each word in our datasets vocabulary, then we'll iterate over the words in our dataset and if there's a match in the GloVE embeddings we'll insert it into the matrix.

embeddings_matrix = numpy.zeros((vocabulary_size+1, Text.embedding_dim));
for word, index in word_index.items():
    embedding_vector = embeddings.get(word);
    if embedding_vector is not None:
        embeddings_matrix[index] = embedding_vector;
print(f"{len(embeddings_matrix):,}")
690,961

End

Citations

  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Raw

Embeddings from Scratch

Beginning

This is a walk-through of the tensorflow Word Embeddings tutorial, just to make sure I can do it.

Imports

Python

from argparse import Namespace
from functools import partial

PyPi

from tensorflow import keras
from tensorflow.keras import layers
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Others

from graeae import EmbedHoloviews, Timer

Set Up

Plotting

prefix = "../../files/posts/keras/"
slug = "embeddings-from-scratch"

Embed = partial(EmbedHoloviews, folder_path=f"{prefix}{slug}")

The Timer

TIMER = Timer()

Middle

Some Constants

Text = Namespace(
    vocabulary_size=1000,
    embeddings_size=16,
    max_length=500,
    padding="post",
)

Tokens = Namespace(
    padding = "<PAD>",
    start = "<START>",
    unknown = "<UNKNOWN>",
    unused = "<UNUSED>",
)

The Embeddings Layer

print(layers.Embedding.__doc__)
Turns positive integers (indexes) into dense vectors of fixed size.

  e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]`

  This layer can only be used as the first layer in a model.

  Example:

  ```python
  model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch,
  # input_length).
  # the largest integer (i.e. word index) in the input should be no larger
  # than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch
  # dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)
  ```

  Arguments:
    input_dim: int > 0. Size of the vocabulary,
      i.e. maximum integer index + 1.
    output_dim: int >= 0. Dimension of the dense embedding.
    embeddings_initializer: Initializer for the `embeddings` matrix.
    embeddings_regularizer: Regularizer function applied to
      the `embeddings` matrix.
    embeddings_constraint: Constraint function applied to
      the `embeddings` matrix.
    mask_zero: Whether or not the input value 0 is a special "padding"
      value that should be masked out.
      This is useful when using recurrent layers
      which may take variable length input.
      If this is `True` then all subsequent layers
      in the model need to support masking or an exception will be raised.
      If mask_zero is set to True, as a consequence, index 0 cannot be
      used in the vocabulary (input_dim should equal size of
      vocabulary + 1).
    input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

  Input shape:
    2D tensor with shape: `(batch_size, input_length)`.

  Output shape:
    3D tensor with shape: `(batch_size, input_length, output_dim)`.
  
embedding_layer = layers.Embedding(Text.vocabulary_size, Text.embeddings_size)

The first argument is the number of possible words in the vocabulary and the second is the number of dimensions. The Emebdding is a sort of lookup table that maps an integer that represents a word to a vector. In this case we're going to build a vocabulary of 1,000 words represented by vectors with a length of 32. The weights in the vectors are learned when we train the model and will encode the distance between words.

The input to the embeddings layer is a 2D tensor of integers with the shape (number of samples, sequence_length). The sequences are integer-encoded sentences of the same length - so you have to pad the shorter sentences to match the longest one (the sequence_length).

The ouput of the embeddings layer is a 3D tensor with the shape (number of samples, sequence_length, embedding_dimensionality).

The Dataset

(train_data, test_data), info = tensorflow_datasets.load(
    "imdb_reviews/subwords8k",
    split=(tensorflow_datasets.Split.TRAIN,
           tensorflow_datasets.Split.TEST),
    with_info=True, as_supervised=True)
encoder = info.features["text"].encoder
print(encoder.subwords[:10])
['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br']

Add Padding

padded_shapes = ([None], ())
train_batches = train_data.shuffle(Text.vocabulary_size).padded_batch(
    10, padded_shapes=padded_shapes)
test_batches = test_data.shuffle(Text.vocabulary_size).padded_batch(
    10, padded_shapes=padded_shapes
)

Checkout a Sample

batch, labels = next(iter(train_batches))
print(batch.numpy())
[[  62    9    4 ...    0    0    0]
 [  19 2428    6 ...    0    0    0]
 [ 691    2  594 ... 7961 1457 7975]
 ...
 [6072 5644 8043 ...    0    0    0]
 [ 977   15   57 ...    0    0    0]
 [5646    2    1 ...    0    0    0]]

Build a Model

model = keras.Sequential([
    layers.Embedding(encoder.vocab_size, Text.embeddings_size),
    layers.GlobalAveragePooling1D(),
    layers.Dense(1, activation="sigmoid")
])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
=================================================================
Total params: 130,977
Trainable params: 130,977
Non-trainable params: 0
_________________________________________________________________
None

Compile and Train

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
ONCE_PER_EPOCH = 2
with TIMER:
    history = model.fit(train_batches, epochs=10,
                        validation_data=test_batches,
                        verbose=ONCE_PER_EPOCH,
                        validation_steps=20)
2019-09-28 17:14:52,764 graeae.timers.timer start: Started: 2019-09-28 17:14:52.764725
I0928 17:14:52.764965 140515023214400 timer.py:70] Started: 2019-09-28 17:14:52.764725
W0928 17:14:52.806057 140515023214400 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/10
 val_loss: 0.3015 - val_accuracy: 0.8900
2019-09-28 17:17:36,036 graeae.timers.timer end: Ended: 2019-09-28 17:17:36.036090
I0928 17:17:36.036139 140515023214400 timer.py:77] Ended: 2019-09-28 17:17:36.036090
2019-09-28 17:17:36,037 graeae.timers.timer end: Elapsed: 0:02:43.271365
I0928 17:17:36.037808 140515023214400 timer.py:78] Elapsed: 0:02:43.271365

End

data = pandas.DataFrame(history.history)
plot = data.hvplot().opts(title="Training/Validation Performance",
                          width=1000,
                          height=800)
Embed(plot=plot, file_name="training")()

Figure Missing

Amazingly, even with such a simple model, it managed a 92 % validation accuracy.

IMDB GRU With Tokenization

Beginning

This is another version of the RNN model to classify the IMDB reviews, but this time we're going to tokenize it ourselves and use a GRU, instead of using the tensorflow-datasets version.

Imports

Python

from argparse import Namespace

PyPi

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import hvplot.pandas
import numpy
import pandas
import tensorflow
import tensorflow_datasets

Other

from graeae import Timer, EmbedHoloviews

Set Up

The Timer

TIMER = Timer()

Plotting

Middle

Set Up the Data

imdb, info = tensorflow_datasets.load("imdb_reviews",
                                      with_info=True,
                                      as_supervised=True)
WARNING: Logging before flag parsing goes to stderr.
W0924 21:52:10.158111 139862640383808 dataset_builder.py:439] Warning: Setting shuffle_files=True because split=TRAIN and shuffle_files=None. This behavior will be deprecated on 2019-08-06, at which point shuffle_files=False will be the default for all splits.

training, testing = imdb["train"], imdb["test"]

Building Up the Tokenizer

Since we didn't pass in a specifier for the configuration we wanted (e.g. imdb/subwords8k) it defaulted to giving us the plain text reviews (and their labels) so we have to build the tokenizer ourselves.

Split Up the Sentences and Their Labels

As you might recall, the data set consists of 50,000 IMDB movie reviews categorized as positive or negative. To build the tokenize we first have to split the sentences from their labels

training_sentences = []
training_labels = []
testing_sentences = []
testing_labels = []
with TIMER:
    for sentence, label in training:
        training_sentences.append(str(sentence.numpy()))
        training_labels.append(str(label.numpy()))


    for sentence, label in testing:
        testing_sentences.append(str(sentence.numpy))
        testing_labels.append(str(label.numpy()))
2019-09-24 21:52:11,396 graeae.timers.timer start: Started: 2019-09-24 21:52:11.395126
I0924 21:52:11.396310 139862640383808 timer.py:70] Started: 2019-09-24 21:52:11.395126
2019-09-24 21:52:18,667 graeae.timers.timer end: Ended: 2019-09-24 21:52:18.667789
I0924 21:52:18.667830 139862640383808 timer.py:77] Ended: 2019-09-24 21:52:18.667789
2019-09-24 21:52:18,670 graeae.timers.timer end: Elapsed: 0:00:07.272663
I0924 21:52:18.670069 139862640383808 timer.py:78] Elapsed: 0:00:07.272663

training_labels_final = numpy.array(training_labels)
testing_labels_final = numpy.array(testing_labels)

Some Constants

Text = Namespace(
    vocab_size = 10000,
    embedding_dim = 16,
    max_length = 120,
    trunc_type='post',
    oov_token = "<OOV>",
)

Build the Tokenizer

tokenizer = Tokenizer(num_words=Text.vocab_size, oov_token=Text.oov_token)
with TIMER:
    tokenizer.fit_on_texts(training_sentences)

    word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(training_sentences)
    padded = pad_sequences(sequences, maxlen=Text.max_length, truncating=Text.trunc_type)

    testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
    testing_padded = pad_sequences(testing_sequences, maxlen=Text.max_length)
2019-09-24 21:52:21,705 graeae.timers.timer start: Started: 2019-09-24 21:52:21.705287
I0924 21:52:21.705317 139862640383808 timer.py:70] Started: 2019-09-24 21:52:21.705287
2019-09-24 21:52:32,152 graeae.timers.timer end: Ended: 2019-09-24 21:52:32.152267
I0924 21:52:32.152314 139862640383808 timer.py:77] Ended: 2019-09-24 21:52:32.152267
2019-09-24 21:52:32,154 graeae.timers.timer end: Elapsed: 0:00:10.446980
I0924 21:52:32.154620 139862640383808 timer.py:78] Elapsed: 0:00:10.446980

Decoder Ring

index_to_word = {value: key for key, value in word_index.items()}

def decode_review(text: numpy.array) -> str:
    return " ".join([index_to_word.get(item, "<?>") for item in text])

Build the Model

This time we're going to build a four-layer model with one Bidirectional layer that uses a GRU (Gated Recurrent Unit) instead of a LSTM.

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(Text.vocab_size, Text.embedding_dim, input_length=Text.max_length),
    tensorflow.keras.layers.Bidirectional(tensorflow.compat.v2.keras.layers.GRU(32)),
    tensorflow.keras.layers.Dense(6, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                9600      
_________________________________________________________________
dense (Dense)                (None, 6)                 390       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
=================================================================
Total params: 169,997
Trainable params: 169,997
Non-trainable params: 0
_________________________________________________________________
None

Train it

EPOCHS = 50
ONCE_PER_EPOCH = 2
batch_size = 8
history = model.fit(padded, training_labels_final,
                    epochs=EPOCHS,
                    batch_size=batch_size,
                    validation_data=(testing_padded, testing_labels_final),
                    verbose=ONCE_PER_EPOCH)

Plot It

data = pandas.DataFrame(history.history)
plot = data.hvplot().opts(title="GRU Training Performance", width=1000, height=800)
Embed(plot=plot, file_name="gru_training")()

Raw

He Used Sarcasm

Beginning

This is a look at fitting a model to detect sarcasm using a json blob from Laurence Moroney (https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json).

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path
from pprint import pprint
from urllib.parse import urlparse
import json

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import hvplot.pandas
import pandas
import tensorflow

Other

from graeae import (
    CountPercentage,
    EmbedHoloviews,
    TextDownloader,
    Timer
)

Set Up

The Timer

TIMER = Timer()

The Plotting

SLUG = "he-used-sarcasm"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")

The Data

URL = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json"
path = Path("~/data/datasets/text/sarcasm/sarcasm.json").expanduser()
downloader = TextDownloader(URL, path)
with TIMER:
    data = json.loads(downloader.download)
2019-09-22 15:05:27,302 graeae.timers.timer start: Started: 2019-09-22 15:05:27.301225
WARNING: Logging before flag parsing goes to stderr.
I0922 15:05:27.302001 139873020925760 timer.py:70] Started: 2019-09-22 15:05:27.301225
2019-09-22 15:05:27,306 [1mTextDownloader[0m download: /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it
I0922 15:05:27.306186 139873020925760 downloader.py:51] /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it
2019-09-22 15:05:27,367 graeae.timers.timer end: Ended: 2019-09-22 15:05:27.367036
I0922 15:05:27.367099 139873020925760 timer.py:77] Ended: 2019-09-22 15:05:27.367036
2019-09-22 15:05:27,369 graeae.timers.timer end: Elapsed: 0:00:00.065811
I0922 15:05:27.369417 139873020925760 timer.py:78] Elapsed: 0:00:00.065811

Middle

Looking At the Data

pprint(data[0])
{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
 'headline': "former versace store clerk sues over secret 'black code' for "
             'minority shoppers',
 'is_sarcastic': 0}

So our data is a dictionary with three keys - the source of the article, the headline of the article, and whether it's a sarcastic headline or not. There's no citation in the original notebook, but it looks like it might be this one on GitHub.

data = pandas.DataFrame(data)
data.loc[:, "site"] = data.article_link.apply(lambda link: urlparse(link).netloc)
CountPercentage(data.site, value_label="Site")()
Site Count Percent (%)
www.huffingtonpost.com 14403 53.93
www.theonion.com 5811 21.76
local.theonion.com 2852 10.68
politics.theonion.com 1767 6.62
entertainment.theonion.com 1194 4.47
www.huffingtonpost.comhttp: 503 1.88
sports.theonion.com 100 0.37
www.huffingtonpost.comhttps: 79 0.30

So, it looks like there's some problems with the URLs. I don't think that's important, but maybe I can clean up a little anyway.

print(
    data[
        data.site.str.contains("www.huffingtonpost.comhttp"
                               )].article_link.value_counts())
https://www.huffingtonpost.comhttp://nymag.com/daily/intelligencer/2016/05/hillary-clinton-candidacy.html                                                                                              2
https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostQueerVoices/videos/1153919084666530/                                                                                                    1
https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/chris-meloni-star-underground/56d8584f99ec6dca3d00000a                                                                          1
https://www.huffingtonpost.comhttp://www.thestreet.com/story/13223501/1/post-retirement-work-may-not-save-your-golden-years.html                                                                       1
https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostEntertainment/                                                                                                                          1
                                                                                                                                                                                                      ..
https://www.huffingtonpost.comhttp://nymag.com/thecut/2015/10/first-legal-abortionists-tell-their-stories.html?mid=twitter_nymag                                                                       1
https://www.huffingtonpost.comhttp://www.tampabay.com/blogs/the-buzz-florida-politics/marco-rubio-warming-up-to-donald-trump/2275308                                                                   1
https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/porn-to-pay-for-college/55aeadf62b8c2a2f6f000193                                                                                1
https://www.huffingtonpost.comhttps://www.thedodo.com/dog-mouth-taped-shut-facebook-1481874724.html                                                                                                    1
https://www.huffingtonpost.comhttps://www.washingtonpost.com/politics/ben-carson-to-tell-supporters-he-sees-no-path-forward-for-campaign/2016/03/02/d6bef352-d9b3-11e5-891a-4ed04f4213e8_story.html    1
Name: article_link, Length: 581, dtype: int64

That's kind of odd, I don't know what that means, maybe the Huffington Post was citing other sites? I went to go check the GitHub dataset I mentioned but it's actually much larger than this one so I don't know if it's really the source or not.

prefixes = ("www.huffingtonpost.comhttp:", "www.huffingtonpost.comhttps:")
for prefix in prefixes:
    data.loc[:, "site"] = data.site.str.replace(
        prefix,
        "www.huffingtonpost.com")

prefixes = ("local.theonion.com",
            "politics.theonion.com",
            "entertainment.theonion.com",
            "sports.theonion.com")

for prefix in prefixes:
    data.loc[:, "site"] = data.site.str.replace(prefix,
                                                "www.theonion.com")
counter = CountPercentage(data.site, value_label="Site")
counter()
Site Count Percent (%)
www.huffingtonpost.com 14985 56.10
www.theonion.com 11724 43.90
plot = counter.table.hvplot.bar(x="Site", y="Count").opts(
    title="Distribution by Site",
    width=1000,
    height=800)
Embed(plot=plot, file_name="site_distribution")()

Figure Missing

counter = CountPercentage(data.is_sarcastic, value_label="Is Sarcastic")
counter()
Is Sarcastic Count Percent (%)
0 14,985 56.10
1 11,724 43.90

Given that the counts match I'm assuming anything from the Huffington Post is labeled as not sarcastic and anything from the onion is sarcastic.

assert all(data[data.site=="www.onion.com"].is_sarcastic)
assert not any(data[data.site=="www.huffingtonpost.com"].is_sarcastic)

Set Up the Tokenizing and Training Data

print(f"{len(data):,}")
26,709

Text = Namespace(
    vocabulary_size = 1000,
    embedding_dim = 16,
    max_length = 120,
    truncating_type='post',
    padding_type='post',
    out_of_vocabulary_tok = "<OOV>",
)

# this is actually the default for train_test_split
Training = Namespace(
    size = 0.75,
    epochs = 50,
    verbosity = 2,
    )

The Training and Testing Data

x_train, x_test, y_train, y_test = train_test_split(
    data.headline, data.is_sarcastic, train_size=Training.size,
)

The Tokenizer

tokenizer = Tokenizer(num_words=Text.vocabulary_size,
                      oov_token=Text.out_of_vocabulary_tok)
print(tokenizer.__doc__)
Text tokenization utility class.

    This class allows to vectorize a text corpus, by turning each
    text into either a sequence of integers (each integer being the index
    of a token in a dictionary) or into a vector where the coefficient
    for each token could be binary, based on word count, based on tf-idf...

    # Arguments
        num_words: the maximum number of words to keep, based
            on word frequency. Only the most common `num_words-1` words will
            be kept.
        filters: a string where each element is a character that will be
            filtered from the texts. The default is all punctuation, plus
            tabs and line breaks, minus the `'` character.
        lower: boolean. Whether to convert the texts to lowercase.
        split: str. Separator for word splitting.
        char_level: if True, every character will be treated as a token.
        oov_token: if given, it will be added to word_index and used to
            replace out-of-vocabulary words during text_to_sequence calls

    By default, all punctuation is removed, turning the texts into
    space-separated sequences of words
    (words maybe include the `'` character). These sequences are then
    split into lists of tokens. They will then be indexed or vectorized.

    `0` is a reserved index that won't be assigned to any word.

Now that we have a tokenizer we can tokenize our training headlines.

help(tokenizer.fit_on_texts)
Help on method fit_on_texts in module keras_preprocessing.text:

fit_on_texts(texts) method of keras_preprocessing.text.Tokenizer instance
    Updates internal vocabulary based on a list of texts.
    
    In the case where texts contains lists,
    we assume each entry of the lists to be a token.
    
    Required before using `texts_to_sequences` or `texts_to_matrix`.
    
    # Arguments
        texts: can be a list of strings,
            a generator of strings (for memory-efficiency),
            or a list of list of strings.

tokenizer.fit_on_texts(x_train)

Now that we've fit the headlines we can get the word index, a dict mapping words to their index.

word_index = tokenizer.word_index

Note that the tokenizer doesn't remove stop-words.

print("the" in word_index)
True

Now we'll convert the training headlines to sequences of numbers.

help(tokenizer.texts_to_sequences)
Help on method texts_to_sequences in module keras_preprocessing.text:

texts_to_sequences(texts) method of keras_preprocessing.text.Tokenizer instance
    Transforms each text in texts to a sequence of integers.
    
    Only top `num_words-1` most frequent words will be taken into account.
    Only words known by the tokenizer will be taken into account.
    
    # Arguments
        texts: A list of texts (strings).
    
    # Returns
        A list of sequences.

We're also going to have to pad them to make them the same length.

help(pad_sequences)
Help on function pad_sequences in module keras_preprocessing.sequence:

pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)
    Pads sequences to the same length.
    
    This function transforms a list of
    `num_samples` sequences (lists of integers)
    into a 2D Numpy array of shape `(num_samples, num_timesteps)`.
    `num_timesteps` is either the `maxlen` argument if provided,
    or the length of the longest sequence otherwise.
    
    Sequences that are shorter than `num_timesteps`
    are padded with `value` at the end.
    
    Sequences longer than `num_timesteps` are truncated
    so that they fit the desired length.
    The position where padding or truncation happens is determined by
    the arguments `padding` and `truncating`, respectively.
    
    Pre-padding is the default.
    
    # Arguments
        sequences: List of lists, where each element is a sequence.
        maxlen: Int, maximum length of all sequences.
        dtype: Type of the output sequences.
            To pad sequences with variable length strings, you can use `object`.
        padding: String, 'pre' or 'post':
            pad either before or after each sequence.
        truncating: String, 'pre' or 'post':
            remove values from sequences larger than
            `maxlen`, either at the beginning or at the end of the sequences.
        value: Float or String, padding value.
    
    # Returns
        x: Numpy array with shape `(len(sequences), maxlen)`
    
    # Raises
        ValueError: In case of invalid values for `truncating` or `padding`,
            or in case of invalid shape for a `sequences` entry.

training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(training_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)

testing_sequences = tokenizer.texts_to_sequences(x_test)
testing_padded = pad_sequences(testing_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)

Build The Model

We're going to use a convolutional neural network to try and classify our headlines as sarcastic or not-sarcastic.

It's a Sequence of Layers

print(tensorflow.keras.Sequential.__doc__)
Linear stack of layers.

  Arguments:
      layers: list of layers to add to the model.

  Example:

  ```python
  # Optionally, the first layer can receive an `input_shape` argument:
  model = Sequential()
  model.add(Dense(32, input_shape=(500,)))
  # Afterwards, we do automatic shape inference:
  model.add(Dense(32))

  # This is identical to the following:
  model = Sequential()
  model.add(Dense(32, input_dim=500))

  # And to the following:
  model = Sequential()
  model.add(Dense(32, batch_input_shape=(None, 500)))

  # Note that you can also omit the `input_shape` argument:
  # In that case the model gets built the first time you call `fit` (or other
  # training and evaluation methods).
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.compile(optimizer=optimizer, loss=loss)
  # This builds the model for the first time:
  model.fit(x, y, batch_size=32, epochs=10)

  # Note that when using this delayed-build pattern (no input shape specified),
  # the model doesn't have any weights until the first call
  # to a training/evaluation method (since it isn't yet built):
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.weights  # returns []

  # Whereas if you specify the input shape, the model gets built continuously
  # as you are adding layers:
  model = Sequential()
  model.add(Dense(32, input_shape=(500,)))
  model.add(Dense(32))
  model.weights  # returns list of length 4

  # When using the delayed-build pattern (no input shape specified), you can
  # choose to manually build your model by calling `build(batch_input_shape)`:
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.build((None, 500))
  model.weights  # returns list of length 4
  ```
  

Start With An Embedding Layer

print(tensorflow.keras.layers.Embedding.__doc__)
Turns positive integers (indexes) into dense vectors of fixed size.

  e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]`

  This layer can only be used as the first layer in a model.

  Example:

  ```python
  model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch,
  # input_length).
  # the largest integer (i.e. word index) in the input should be no larger
  # than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch
  # dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)
  ```

  Arguments:
    input_dim: int > 0. Size of the vocabulary,
      i.e. maximum integer index + 1.
    output_dim: int >= 0. Dimension of the dense embedding.
    embeddings_initializer: Initializer for the `embeddings` matrix.
    embeddings_regularizer: Regularizer function applied to
      the `embeddings` matrix.
    embeddings_constraint: Constraint function applied to
      the `embeddings` matrix.
    mask_zero: Whether or not the input value 0 is a special "padding"
      value that should be masked out.
      This is useful when using recurrent layers
      which may take variable length input.
      If this is `True` then all subsequent layers
      in the model need to support masking or an exception will be raised.
      If mask_zero is set to True, as a consequence, index 0 cannot be
      used in the vocabulary (input_dim should equal size of
      vocabulary + 1).
    input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

  Input shape:
    2D tensor with shape: `(batch_size, input_length)`.

  Output shape:
    3D tensor with shape: `(batch_size, input_length, output_dim)`.
  

The Convolutional Layer

print(tensorflow.keras.layers.Conv1D.__doc__)
1D convolution layer (e.g. temporal convolution).

  This layer creates a convolution kernel that is convolved
  with the layer input over a single spatial (or temporal) dimension
  to produce a tensor of outputs.
  If `use_bias` is True, a bias vector is created and added to the outputs.
  Finally, if `activation` is not `None`,
  it is applied to the outputs as well.

  When using this layer as the first layer in a model,
  provide an `input_shape` argument
  (tuple of integers or `None`, e.g.
  `(10, 128)` for sequences of 10 vectors of 128-dimensional vectors,
  or `(None, 128)` for variable-length sequences of 128-dimensional vectors.

  Arguments:
    filters: Integer, the dimensionality of the output space
      (i.e. the number of output filters in the convolution).
    kernel_size: An integer or tuple/list of a single integer,
      specifying the length of the 1D convolution window.
    strides: An integer or tuple/list of a single integer,
      specifying the stride length of the convolution.
      Specifying any stride value != 1 is incompatible with specifying
      any `dilation_rate` value != 1.
    padding: One of `"valid"`, `"causal"` or `"same"` (case-insensitive).
      `"causal"` results in causal (dilated) convolutions, e.g. output[t]
      does not depend on input[t+1:]. Useful when modeling temporal data
      where the model should not violate the temporal order.
      See [WaveNet: A Generative Model for Raw Audio, section
        2.1](https://arxiv.org/abs/1609.03499).
    data_format: A string,
      one of `channels_last` (default) or `channels_first`.
    dilation_rate: an integer or tuple/list of a single integer, specifying
      the dilation rate to use for dilated convolution.
      Currently, specifying any `dilation_rate` value != 1 is
      incompatible with specifying any `strides` value != 1.
    activation: Activation function to use.
      If you don't specify anything, no activation is applied
      (ie. "linear" activation: `a(x) = x`).
    use_bias: Boolean, whether the layer uses a bias vector.
    kernel_initializer: Initializer for the `kernel` weights matrix.
    bias_initializer: Initializer for the bias vector.
    kernel_regularizer: Regularizer function applied to
      the `kernel` weights matrix.
    bias_regularizer: Regularizer function applied to the bias vector.
    activity_regularizer: Regularizer function applied to
      the output of the layer (its "activation")..
    kernel_constraint: Constraint function applied to the kernel matrix.
    bias_constraint: Constraint function applied to the bias vector.

  Examples:
    ```python
    # Small convolutional model for 128-length vectors with 6 timesteps
    # model.input_shape == (None, 6, 128)
    
    model = Sequential()
    model.add(Conv1D(32, 3, 
              activation='relu', 
              input_shape=(6, 128)))
    
    # now: model.output_shape == (None, 4, 32)
    ```

  Input shape:
    3D tensor with shape: `(batch_size, steps, input_dim)`

  Output shape:
    3D tensor with shape: `(batch_size, new_steps, filters)`
      `steps` value might have changed due to padding or strides.
  

A Pooling Layer

print(tensorflow.keras.layers.GlobalMaxPooling1D.__doc__)
Global max pooling operation for temporal data.

  Arguments:
    data_format: A string,
      one of `channels_last` (default) or `channels_first`.
      The ordering of the dimensions in the inputs.
      `channels_last` corresponds to inputs with shape
      `(batch, steps, features)` while `channels_first`
      corresponds to inputs with shape
      `(batch, features, steps)`.

  Input shape:
    - If `data_format='channels_last'`:
      3D tensor with shape:
      `(batch_size, steps, features)`
    - If `data_format='channels_first'`:
      3D tensor with shape:
      `(batch_size, features, steps)`

  Output shape:
    2D tensor with shape `(batch_size, features)`.
  

The Fully-Connected Layers

Finally our output layers.

print(tensorflow.keras.layers.Dense.__doc__)
Just your regular densely-connected NN layer.

  `Dense` implements the operation:
  `output = activation(dot(input, kernel) + bias)`
  where `activation` is the element-wise activation function
  passed as the `activation` argument, `kernel` is a weights matrix
  created by the layer, and `bias` is a bias vector created by the layer
  (only applicable if `use_bias` is `True`).

  Note: If the input to the layer has a rank greater than 2, then
  it is flattened prior to the initial dot product with `kernel`.

  Example:

  ```python
  # as first layer in a sequential model:
  model = Sequential()
  model.add(Dense(32, input_shape=(16,)))
  # now the model will take as input arrays of shape (*, 16)
  # and output arrays of shape (*, 32)

  # after the first layer, you don't need to specify
  # the size of the input anymore:
  model.add(Dense(32))
  ```

  Arguments:
    units: Positive integer, dimensionality of the output space.
    activation: Activation function to use.
      If you don't specify anything, no activation is applied
      (ie. "linear" activation: `a(x) = x`).
    use_bias: Boolean, whether the layer uses a bias vector.
    kernel_initializer: Initializer for the `kernel` weights matrix.
    bias_initializer: Initializer for the bias vector.
    kernel_regularizer: Regularizer function applied to
      the `kernel` weights matrix.
    bias_regularizer: Regularizer function applied to the bias vector.
    activity_regularizer: Regularizer function applied to
      the output of the layer (its "activation")..
    kernel_constraint: Constraint function applied to
      the `kernel` weights matrix.
    bias_constraint: Constraint function applied to the bias vector.

  Input shape:
    N-D tensor with shape: `(batch_size, ..., input_dim)`.
    The most common situation would be
    a 2D input with shape `(batch_size, input_dim)`.

  Output shape:
    N-D tensor with shape: `(batch_size, ..., units)`.
    For instance, for a 2D input with shape `(batch_size, input_dim)`,
    the output would have shape `(batch_size, units)`.
  

Build It

I originally added the layers using the model.add method, but then when I tried to train it the output said the layers didn't have gradients and it never improved… I'll have to look into that, but in the meantime, passing them all in seems to work.

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(
        input_dim=Text.vocabulary_size,
        output_dim=Text.embedding_dim,
        input_length=Text.max_length),
    tensorflow.keras.layers.Conv1D(filters=128,
                                   kernel_size=5,
                                   activation='relu'),
    tensorflow.keras.layers.GlobalMaxPooling1D(),
    tensorflow.keras.layers.Dense(24, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])

Compile It

print(model.compile.__doc__)
Configures the model for training.

    Arguments:
        optimizer: String (name of optimizer) or optimizer instance.
            See `tf.keras.optimizers`.
        loss: String (name of objective function), objective function or
            `tf.losses.Loss` instance. See `tf.losses`. If the model has
            multiple outputs, you can use a different loss on each output by
            passing a dictionary or a list of losses. The loss value that will
            be minimized by the model will then be the sum of all individual
            losses.
        metrics: List of metrics to be evaluated by the model during training
            and testing. Typically you will use `metrics=['accuracy']`.
            To specify different metrics for different outputs of a
            multi-output model, you could also pass a dictionary, such as
            `metrics={'output_a': 'accuracy', 'output_b': ['accuracy', 'mse']}`.
            You can also pass a list (len = len(outputs)) of lists of metrics
            such as `metrics=[['accuracy'], ['accuracy', 'mse']]` or
            `metrics=['accuracy', ['accuracy', 'mse']]`.
        loss_weights: Optional list or dictionary specifying scalar
            coefficients (Python floats) to weight the loss contributions
            of different model outputs.
            The loss value that will be minimized by the model
            will then be the *weighted sum* of all individual losses,
            weighted by the `loss_weights` coefficients.
            If a list, it is expected to have a 1:1 mapping
            to the model's outputs. If a tensor, it is expected to map
            output names (strings) to scalar coefficients.
        sample_weight_mode: If you need to do timestep-wise
            sample weighting (2D weights), set this to `"temporal"`.
            `None` defaults to sample-wise weights (1D).
            If the model has multiple outputs, you can use a different
            `sample_weight_mode` on each output by passing a
            dictionary or a list of modes.
        weighted_metrics: List of metrics to be evaluated and weighted
            by sample_weight or class_weight during training and testing.
        target_tensors: By default, Keras will create placeholders for the
            model's target, which will be fed with the target data during
            training. If instead you would like to use your own
            target tensors (in turn, Keras will not expect external
            Numpy data for these targets at training time), you
            can specify them via the `target_tensors` argument. It can be
            a single tensor (for a single-output model), a list of tensors,
            or a dict mapping output names to target tensors.
        distribute: NOT SUPPORTED IN TF 2.0, please create and compile the
            model under distribution strategy scope instead of passing it to
            compile.
        **kwargs: Any additional arguments.

    Raises:
        ValueError: In case of invalid arguments for
            `optimizer`, `loss`, `metrics` or `sample_weight_mode`.
    
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 120, 16)           16000     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 116, 128)          10368     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 24)                3096      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 25        
=================================================================
Total params: 29,489
Trainable params: 29,489
Non-trainable params: 0
_________________________________________________________________
None

Train It

print(model.fit.__doc__)
Trains the model for a fixed number of epochs (iterations on a dataset).

    Arguments:
        x: Input data. It could be:
          - A Numpy array (or array-like), or a list of arrays
            (in case the model has multiple inputs).
          - A TensorFlow tensor, or a list of tensors
            (in case the model has multiple inputs).
          - A dict mapping input names to the corresponding array/tensors,
            if the model has named inputs.
          - A `tf.data` dataset. Should return a tuple
            of either `(inputs, targets)` or
            `(inputs, targets, sample_weights)`.
          - A generator or `keras.utils.Sequence` returning `(inputs, targets)`
            or `(inputs, targets, sample weights)`.
        y: Target data. Like the input data `x`,
          it could be either Numpy array(s) or TensorFlow tensor(s).
          It should be consistent with `x` (you cannot have Numpy inputs and
          tensor targets, or inversely). If `x` is a dataset, generator,
          or `keras.utils.Sequence` instance, `y` should
          not be specified (since targets will be obtained from `x`).
        batch_size: Integer or `None`.
            Number of samples per gradient update.
            If unspecified, `batch_size` will default to 32.
            Do not specify the `batch_size` if your data is in the
            form of symbolic tensors, datasets,
            generators, or `keras.utils.Sequence` instances (since they generate
            batches).
        epochs: Integer. Number of epochs to train the model.
            An epoch is an iteration over the entire `x` and `y`
            data provided.
            Note that in conjunction with `initial_epoch`,
            `epochs` is to be understood as "final epoch".
            The model is not trained for a number of iterations
            given by `epochs`, but merely until the epoch
            of index `epochs` is reached.
        verbose: 0, 1, or 2. Verbosity mode.
            0 = silent, 1 = progress bar, 2 = one line per epoch.
            Note that the progress bar is not particularly useful when
            logged to a file, so verbose=2 is recommended when not running
            interactively (eg, in a production environment).
        callbacks: List of `keras.callbacks.Callback` instances.
            List of callbacks to apply during training.
            See `tf.keras.callbacks`.
        validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.
            The validation data is selected from the last samples
            in the `x` and `y` data provided, before shuffling. This argument is
            not supported when `x` is a dataset, generator or
           `keras.utils.Sequence` instance.
        validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`.
            `validation_data` could be:
              - tuple `(x_val, y_val)` of Numpy arrays or tensors
              - tuple `(x_val, y_val, val_sample_weights)` of Numpy arrays
              - dataset
            For the first two cases, `batch_size` must be provided.
            For the last case, `validation_steps` must be provided.
        shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.
            Has no effect when `steps_per_epoch` is not `None`.
        class_weight: Optional dictionary mapping class indices (integers)
            to a weight (float) value, used for weighting the loss function
            (during training only).
            This can be useful to tell the model to
            "pay more attention" to samples from
            an under-represented class.
        sample_weight: Optional Numpy array of weights for
            the training samples, used for weighting the loss function
            (during training only). You can either pass a flat (1D)
            Numpy array with the same length as the input samples
            (1:1 mapping between weights and samples),
            or in the case of temporal data,
            you can pass a 2D array with shape
            `(samples, sequence_length)`,
            to apply a different weight to every timestep of every sample.
            In this case you should make sure to specify
            `sample_weight_mode="temporal"` in `compile()`. This argument is not
            supported when `x` is a dataset, generator, or
           `keras.utils.Sequence` instance, instead provide the sample_weights
            as the third element of `x`.
        initial_epoch: Integer.
            Epoch at which to start training
            (useful for resuming a previous training run).
        steps_per_epoch: Integer or `None`.
            Total number of steps (batches of samples)
            before declaring one epoch finished and starting the
            next epoch. When training with input tensors such as
            TensorFlow data tensors, the default `None` is equal to
            the number of samples in your dataset divided by
            the batch size, or 1 if that cannot be determined. If x is a
            `tf.data` dataset, and 'steps_per_epoch'
            is None, the epoch will run until the input dataset is exhausted.
            This argument is not supported with array inputs.
        validation_steps: Only relevant if `validation_data` is provided and
            is a `tf.data` dataset. Total number of steps (batches of
            samples) to draw before stopping when performing validation
            at the end of every epoch. If validation_data is a `tf.data` dataset
            and 'validation_steps' is None, validation
            will run until the `validation_data` dataset is exhausted.
        validation_freq: Only relevant if validation data is provided. Integer
            or `collections_abc.Container` instance (e.g. list, tuple, etc.).
            If an integer, specifies how many training epochs to run before a
            new validation run is performed, e.g. `validation_freq=2` runs
            validation every 2 epochs. If a Container, specifies the epochs on
            which to run validation, e.g. `validation_freq=[1, 2, 10]` runs
            validation at the end of the 1st, 2nd, and 10th epochs.
        max_queue_size: Integer. Used for generator or `keras.utils.Sequence`
            input only. Maximum size for the generator queue.
            If unspecified, `max_queue_size` will default to 10.
        workers: Integer. Used for generator or `keras.utils.Sequence` input
            only. Maximum number of processes to spin up
            when using process-based threading. If unspecified, `workers`
            will default to 1. If 0, will execute the generator on the main
            thread.
        use_multiprocessing: Boolean. Used for generator or
            `keras.utils.Sequence` input only. If `True`, use process-based
            threading. If unspecified, `use_multiprocessing` will default to
            `False`. Note that because this implementation relies on
            multiprocessing, you should not pass non-picklable arguments to
            the generator as they can't be passed easily to children processes.
        **kwargs: Used for backwards compatibility.

    Returns:
        A `History` object. Its `History.history` attribute is
        a record of training loss values and metrics values
        at successive epochs, as well as validation loss values
        and validation metrics values (if applicable).

    Raises:
        RuntimeError: If the model was never compiled.
        ValueError: In case of mismatch between the provided input data
            and what the model expects.
    
with TIMER:
    history = model.fit(training_padded,
                        y_train.values,
                        epochs=Training.epochs,
                        validation_data=(testing_padded, y_test.values),
                        verbose=Training.verbosity)
2019-09-22 16:30:20,369 graeae.timers.timer start: Started: 2019-09-22 16:30:20.369886
I0922 16:30:20.369913 139873020925760 timer.py:70] Started: 2019-09-22 16:30:20.369886
Train on 20031 samples, validate on 6678 samples
Epoch 1/50
20031/20031 - 5s - loss: 0.4741 - accuracy: 0.7623 - val_loss: 0.4033 - val_accuracy: 0.8146
Epoch 2/50
20031/20031 - 4s - loss: 0.3663 - accuracy: 0.8366 - val_loss: 0.3980 - val_accuracy: 0.8196
Epoch 3/50
20031/20031 - 4s - loss: 0.3306 - accuracy: 0.8554 - val_loss: 0.3909 - val_accuracy: 0.8240
Epoch 4/50
20031/20031 - 4s - loss: 0.2990 - accuracy: 0.8721 - val_loss: 0.4148 - val_accuracy: 0.8179
Epoch 5/50
20031/20031 - 4s - loss: 0.2697 - accuracy: 0.8867 - val_loss: 0.4050 - val_accuracy: 0.8282
Epoch 6/50
20031/20031 - 4s - loss: 0.2406 - accuracy: 0.9003 - val_loss: 0.4291 - val_accuracy: 0.8212
Epoch 7/50
20031/20031 - 4s - loss: 0.2080 - accuracy: 0.9165 - val_loss: 0.4650 - val_accuracy: 0.8181
Epoch 8/50
20031/20031 - 4s - loss: 0.1824 - accuracy: 0.9272 - val_loss: 0.5053 - val_accuracy: 0.8130
Epoch 9/50
20031/20031 - 4s - loss: 0.1559 - accuracy: 0.9393 - val_loss: 0.5389 - val_accuracy: 0.8065
Epoch 10/50
20031/20031 - 4s - loss: 0.1325 - accuracy: 0.9498 - val_loss: 0.6213 - val_accuracy: 0.8044
Epoch 11/50
20031/20031 - 4s - loss: 0.1104 - accuracy: 0.9599 - val_loss: 0.6902 - val_accuracy: 0.8034
Epoch 12/50
20031/20031 - 4s - loss: 0.0966 - accuracy: 0.9646 - val_loss: 0.7437 - val_accuracy: 0.8035
Epoch 13/50
20031/20031 - 4s - loss: 0.0848 - accuracy: 0.9689 - val_loss: 0.8285 - val_accuracy: 0.7954
Epoch 14/50
20031/20031 - 4s - loss: 0.0693 - accuracy: 0.9753 - val_loss: 0.9121 - val_accuracy: 0.7934
Epoch 15/50
20031/20031 - 4s - loss: 0.0608 - accuracy: 0.9777 - val_loss: 1.0783 - val_accuracy: 0.7931
Epoch 16/50
20031/20031 - 4s - loss: 0.0529 - accuracy: 0.9810 - val_loss: 1.0620 - val_accuracy: 0.7889
Epoch 17/50
20031/20031 - 4s - loss: 0.0506 - accuracy: 0.9819 - val_loss: 1.2497 - val_accuracy: 0.7889
Epoch 18/50
20031/20031 - 4s - loss: 0.0471 - accuracy: 0.9821 - val_loss: 1.2518 - val_accuracy: 0.7963
Epoch 19/50
20031/20031 - 4s - loss: 0.0457 - accuracy: 0.9819 - val_loss: 1.3492 - val_accuracy: 0.7917
Epoch 20/50
20031/20031 - 4s - loss: 0.0392 - accuracy: 0.9851 - val_loss: 1.3702 - val_accuracy: 0.7948
Epoch 21/50
20031/20031 - 4s - loss: 0.0357 - accuracy: 0.9860 - val_loss: 1.4300 - val_accuracy: 0.7948
Epoch 22/50
20031/20031 - 4s - loss: 0.0341 - accuracy: 0.9864 - val_loss: 1.5654 - val_accuracy: 0.7889
Epoch 23/50
20031/20031 - 4s - loss: 0.0360 - accuracy: 0.9860 - val_loss: 1.5615 - val_accuracy: 0.7951
Epoch 24/50
20031/20031 - 4s - loss: 0.0307 - accuracy: 0.9872 - val_loss: 1.6964 - val_accuracy: 0.7953
Epoch 25/50
20031/20031 - 4s - loss: 0.0283 - accuracy: 0.9893 - val_loss: 1.6917 - val_accuracy: 0.7920
Epoch 26/50
20031/20031 - 4s - loss: 0.0365 - accuracy: 0.9850 - val_loss: 1.6935 - val_accuracy: 0.7944
Epoch 27/50
20031/20031 - 4s - loss: 0.0342 - accuracy: 0.9851 - val_loss: 1.7912 - val_accuracy: 0.7853
Epoch 28/50
20031/20031 - 4s - loss: 0.0301 - accuracy: 0.9879 - val_loss: 1.8194 - val_accuracy: 0.7887
Epoch 29/50
20031/20031 - 4s - loss: 0.0254 - accuracy: 0.9887 - val_loss: 1.9231 - val_accuracy: 0.7922
Epoch 30/50
20031/20031 - 4s - loss: 0.0216 - accuracy: 0.9910 - val_loss: 1.9480 - val_accuracy: 0.7914
Epoch 31/50
20031/20031 - 4s - loss: 0.0243 - accuracy: 0.9895 - val_loss: 1.9487 - val_accuracy: 0.7847
Epoch 32/50
20031/20031 - 4s - loss: 0.0241 - accuracy: 0.9891 - val_loss: 2.0333 - val_accuracy: 0.7893
Epoch 33/50
20031/20031 - 4s - loss: 0.0334 - accuracy: 0.9863 - val_loss: 1.9498 - val_accuracy: 0.7937
Epoch 34/50
20031/20031 - 4s - loss: 0.0318 - accuracy: 0.9873 - val_loss: 2.0181 - val_accuracy: 0.7942
Epoch 35/50
20031/20031 - 4s - loss: 0.0273 - accuracy: 0.9882 - val_loss: 2.0254 - val_accuracy: 0.7913
Epoch 36/50
20031/20031 - 4s - loss: 0.0236 - accuracy: 0.9897 - val_loss: 2.1159 - val_accuracy: 0.7937
Epoch 37/50
20031/20031 - 4s - loss: 0.0204 - accuracy: 0.9905 - val_loss: 2.1018 - val_accuracy: 0.7950
Epoch 38/50
20031/20031 - 4s - loss: 0.0187 - accuracy: 0.9916 - val_loss: 2.1939 - val_accuracy: 0.7947
Epoch 39/50
20031/20031 - 4s - loss: 0.0253 - accuracy: 0.9888 - val_loss: 2.2090 - val_accuracy: 0.7920
Epoch 40/50
20031/20031 - 4s - loss: 0.0270 - accuracy: 0.9889 - val_loss: 2.2737 - val_accuracy: 0.7862
Epoch 41/50
20031/20031 - 4s - loss: 0.0234 - accuracy: 0.9893 - val_loss: 2.2559 - val_accuracy: 0.7926
Epoch 42/50
20031/20031 - 4s - loss: 0.0223 - accuracy: 0.9902 - val_loss: 2.3223 - val_accuracy: 0.7884
Epoch 43/50
20031/20031 - 4s - loss: 0.0251 - accuracy: 0.9897 - val_loss: 2.2547 - val_accuracy: 0.7863
Epoch 44/50
20031/20031 - 4s - loss: 0.0209 - accuracy: 0.9900 - val_loss: 2.3917 - val_accuracy: 0.7823
Epoch 45/50
20031/20031 - 4s - loss: 0.0245 - accuracy: 0.9889 - val_loss: 2.4222 - val_accuracy: 0.7881
Epoch 46/50
20031/20031 - 4s - loss: 0.0215 - accuracy: 0.9901 - val_loss: 2.4135 - val_accuracy: 0.7869
Epoch 47/50
20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9896 - val_loss: 2.3287 - val_accuracy: 0.7823
Epoch 48/50
20031/20031 - 4s - loss: 0.0191 - accuracy: 0.9918 - val_loss: 2.4639 - val_accuracy: 0.7845
Epoch 49/50
20031/20031 - 4s - loss: 0.0183 - accuracy: 0.9911 - val_loss: 2.6068 - val_accuracy: 0.7811
Epoch 50/50
20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9897 - val_loss: 2.5152 - val_accuracy: 0.7928
2019-09-22 16:33:41,089 graeae.timers.timer end: Ended: 2019-09-22 16:33:41.089405
I0922 16:33:41.089459 139873020925760 timer.py:77] Ended: 2019-09-22 16:33:41.089405
2019-09-22 16:33:41,091 graeae.timers.timer end: Elapsed: 0:03:20.719519
I0922 16:33:41.091247 139873020925760 timer.py:78] Elapsed: 0:03:20.719519

Once again it looks like the model is overfitting, I should add a checkpoint or something.

Plot the Performance

performance = pandas.DataFrame(history.history)
plot = performance.hvplot().opts(title="CNN Sarcasm Training Performance",
                                 width=1000,
                                 height=800)
Embed(plot=plot, file_name="cnn_training")()

Figure Missing

There's something very wrong with the validation. I'll have to look into that.


End

Multi-Layer LSTM

Beginning

Imports

Python

from functools import partial
from pathlib import Path
import pickle

PyPi

import holoviews
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Others

from graeae import Timer, EmbedHoloviews

Set Up

The Timer

TIMER = Timer()

Plotting

Embed = partial(EmbedHoloviews,
                folder_path="../../files/posts/keras/multi-layer-lstm/")

The Dataset

This once again uses the IMDB dataset with 50,000 reviews. It has already been converted from strings to integers - each word is encoded as its own integer. Adding with_info=True returns an object that contains the dictionary with the word to integer mapping. Passing in imdb_reviews/subwords8k limits the vocabulary to 8,000 words.

Note: The first time you run this it will download a fairly large dataset so it might appear to hang, but after the first time it is fairly quick.

dataset, info = tensorflow_datasets.load("imdb_reviews/subwords8k",
                                         with_info=True,
                                         as_supervised=True)

Middle

Set Up the Datasets

train_dataset, test_dataset = dataset["train"], dataset["test"]
tokenizer = info.features['text'].encoder

Now we're going to shuffle and padd the data. The BUFFER_SIZE argument sets the size of the data to sample from. In this case 10,000 entries in the training set will be selected to be put in the buffer and then the "shuffle" is created by randomly selecting items from the buffer, replacing each item as it's selected until all the data has been through the buffer. The padded_batch method creates batches of consecutive data and pads them so that they are all the same shape.

The BATCH_SIZE needs to be tuned a little. If it's too big the amount of memory needed might keep the GPU from being able to use it (and it might not generalize), and if it's too small, you will take a long time to train, so you have to do a little tuning. If you train it and the GPU process percentage stays at 0, try reducing the Batch Size.

Also note that if you change the batch-size you have to go back to the previous step and re-define train_dataset and test_dataset because we alter them in the next step and re-altering them makes the shape wrong somehow.

BUFFER_SIZE = 10000
# if the batch size is too big it will run out of memory on the GPU 
# so you might have to experiment with this
BATCH_SIZE = 32

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

The Model

The previous model had one Bidirectional layer, this will add a second one.

Embedding

The Embedding layer converts our inputs of integers and converts them to vectors of real-numbers, which is a better input for a neural network.

Bidirectional

The Bidirectional layer is a wrapper for Recurrent Neural Networks.

LSTM

The LSTM layer implements Long-Short-Term Memory. The first argument is the size of the outputs. This is similar to the model that we ran previously on the same data, but it has an extra layer (so it uses more memory).

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tensorflow.keras.layers.Bidirectional(
        tensorflow.keras.layers.LSTM(64, return_sequences=True)),
    tensorflow.keras.layers.Bidirectional(
        tensorflow.keras.layers.LSTM(32)),
    tensorflow.keras.layers.Dense(64, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          523840    
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         66048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                41216     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 635,329
Trainable params: 635,329
Non-trainable params: 0
_________________________________________________________________
None

Compile It

model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])

Train the Model

ONCE_PER_EPOCH = 2
NUM_EPOCHS = 10
with TIMER:
    history = model.fit(train_dataset,
                        epochs=NUM_EPOCHS,
                        validation_data=test_dataset,
                        verbose=ONCE_PER_EPOCH)
2019-09-21 17:26:50,395 graeae.timers.timer start: Started: 2019-09-21 17:26:50.394797
I0921 17:26:50.395130 140275698915136 timer.py:70] Started: 2019-09-21 17:26:50.394797
Epoch 1/10
W0921 17:26:51.400280 140275698915136 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
782/782 - 224s - loss: 0.6486 - accuracy: 0.6039 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
782/782 - 214s - loss: 0.4941 - accuracy: 0.7661 - val_loss: 0.6706 - val_accuracy: 0.6744
Epoch 3/10
782/782 - 216s - loss: 0.4087 - accuracy: 0.8266 - val_loss: 0.4024 - val_accuracy: 0.8222
Epoch 4/10
782/782 - 217s - loss: 0.2855 - accuracy: 0.8865 - val_loss: 0.3343 - val_accuracy: 0.8645
Epoch 5/10
782/782 - 216s - loss: 0.2097 - accuracy: 0.9217 - val_loss: 0.2936 - val_accuracy: 0.8837
Epoch 6/10
782/782 - 217s - loss: 0.1526 - accuracy: 0.9467 - val_loss: 0.3188 - val_accuracy: 0.8771
Epoch 7/10
782/782 - 215s - loss: 0.1048 - accuracy: 0.9657 - val_loss: 0.3750 - val_accuracy: 0.8710
Epoch 8/10
782/782 - 216s - loss: 0.0764 - accuracy: 0.9757 - val_loss: 0.3821 - val_accuracy: 0.8762
Epoch 9/10
782/782 - 216s - loss: 0.0585 - accuracy: 0.9832 - val_loss: 0.4747 - val_accuracy: 0.8683
Epoch 10/10
782/782 - 216s - loss: 0.0438 - accuracy: 0.9883 - val_loss: 0.4441 - val_accuracy: 0.8704
2019-09-21 18:02:56,353 graeae.timers.timer end: Ended: 2019-09-21 18:02:56.353722
I0921 18:02:56.353781 140275698915136 timer.py:77] Ended: 2019-09-21 18:02:56.353722
2019-09-21 18:02:56,356 graeae.timers.timer end: Elapsed: 0:36:05.958925
I0921 18:02:56.356238 140275698915136 timer.py:78] Elapsed: 0:36:05.958925

Looking at the Performance

To get the history I had to pickle it and then copy it over to the machine with this org-notebook, so you can't just run this notebook and make it work unless everything is run on the same machine (which it wasn't).

path = Path("~/history.pkl").expanduser()
with path.open("wb") as writer:
    pickle.dump(history.history, writer)
path = Path("~/history.pkl").expanduser()
with path.open("rb") as reader:
    history = pickle.load(reader)
data = pandas.DataFrame(history)
best = data.val_loss.idxmin()
best_line = holoviews.VLine(best)
plot = (data.hvplot() * best_line).opts(
    title="Two-Layer LSTM Model",
    width=1000,
    height=800)
Embed(plot=plot, file_name="lstm_training")()

Figure Missing

It looks like the best epoch was the fifth one, with a validation loss of 0.29 and a validation accuracy of 0.88, after that it looks like it overfits. It seems that text might be a harder problem than images.

IMDB Reviews Tensorflow Dataset

Beginning

We're going to use the IMDB Reviews Dataset (used in this tutorial) - a set of 50,000 movie reviews taken from the Internet Movie Database that have been classified as either positive or negative. It looks like the original source is from a page on Stanford University's web sight title Large Movie Review Dataset. The dataset seems to be widely available (the Stanford page and Kaggle for instance) but this will serve as practice for using tensorflow datasets as well.

Imports

Python

from functools import partial

PyPi

import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Graeae

from graeae import EmbedHoloviews, Timer

Set Up

Plotting

SLUG = "imdb-reviews-tensorflow-dataset"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")

Timer

TIMER = Timer()

Middle

Get the Dataset

Load It

The load function takes quite a few parameters, in this case we're just passing in three - the name of the dataset, with_info which tells it to return both a Dataset and a DatasetInfo object, and as_supervised, which tells the builder to return the Dataset as a series of (input, label) tuples.

dataset, info = tensorflow_datasets.load('imdb_reviews/subwords8k',
                                         with_info=True,
                                         as_supervised=True)

Split It

The dataset is a dict with three keys:

print(dataset.keys())
dict_keys(['test', 'train', 'unsupervised'])

As you might guess, we don't use the unsupervised key.

train_dataset, test_dataset = dataset['train'], dataset['test']

The Tokenizer

One of the advantages of using the tensorflow dataset version of this is that it comes with a pre-built tokenizer inside the DatasetInfo object.

print(info.features)
FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>),
})

tokenizer = info.features['text'].encoder
print(tokenizer)
<SubwordTextEncoder vocab_size=8185>

The tokenizer is a SubwordTextEncoder with a vocabulary size of 8,185.

Set Up Data

We're going to shuffle the training data and then add padding to both sets so theyre all the same size.

BUFFER_SIZE = 20000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

The Model

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tensorflow.keras.layers.Bidirectional(tensorflow.keras.layers.LSTM(64)),
    tensorflow.keras.layers.Dense(64, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          523840    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 598,209
Trainable params: 598,209
Non-trainable params: 0
_________________________________________________________________

Compile It

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Train It

EPOCHS = 10
SILENT = 0
ONCE_PER_EPOCH = 2
with TIMER:
    history = model.fit(train_dataset,
                        epochs=EPOCHS,
                        validation_data=test_dataset,
                        verbose=ONCE_PER_EPOCH)
2019-09-21 15:52:50,469 graeae.timers.timer start: Started: 2019-09-21 15:52:50.469787
I0921 15:52:50.469841 140086305412928 timer.py:70] Started: 2019-09-21 15:52:50.469787
Epoch 1/10
391/391 - 80s - loss: 0.3991 - accuracy: 0.8377 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
391/391 - 80s - loss: 0.3689 - accuracy: 0.8571 - val_loss: 0.4595 - val_accuracy: 0.8021
Epoch 3/10
391/391 - 80s - loss: 0.3664 - accuracy: 0.8444 - val_loss: 0.5262 - val_accuracy: 0.7228
Epoch 4/10
391/391 - 80s - loss: 0.5611 - accuracy: 0.7133 - val_loss: 0.6832 - val_accuracy: 0.6762
Epoch 5/10
391/391 - 80s - loss: 0.6151 - accuracy: 0.6597 - val_loss: 0.5164 - val_accuracy: 0.7844
Epoch 6/10
391/391 - 80s - loss: 0.3842 - accuracy: 0.8340 - val_loss: 0.4970 - val_accuracy: 0.7996
Epoch 7/10
391/391 - 80s - loss: 0.2449 - accuracy: 0.9058 - val_loss: 0.3639 - val_accuracy: 0.8463
Epoch 8/10
391/391 - 80s - loss: 0.1896 - accuracy: 0.9306 - val_loss: 0.3698 - val_accuracy: 0.8614
Epoch 9/10
391/391 - 80s - loss: 0.1555 - accuracy: 0.9456 - val_loss: 0.3896 - val_accuracy: 0.8535
Epoch 10/10
391/391 - 80s - loss: 0.1195 - accuracy: 0.9606 - val_loss: 0.4878 - val_accuracy: 0.8428
2019-09-21 16:06:09,935 graeae.timers.timer end: Ended: 2019-09-21 16:06:09.935707
I0921 16:06:09.935745 140086305412928 timer.py:77] Ended: 2019-09-21 16:06:09.935707
2019-09-21 16:06:09,938 graeae.timers.timer end: Elapsed: 0:13:19.465920
I0921 16:06:09.938812 140086305412928 timer.py:78] Elapsed: 0:13:19.465920

Plot the Performance

  • Note: This only works if your kernel is on the local machine, running it remotely gives an error, as it tries to save it on the remote machine.
data = pandas.DataFrame(history.history)
data = data.rename(columns={"loss": "Training Loss",
                            "accuracy": "Training Accuracy",
                            "val_loss": "Validation Loss",
                            "val_accuracy": "Validation Accuracy"})
plot = data.hvplot().opts(title="LSTM IMDB Performance", width=1000, height=800)
Embed(plot=plot, file_name="model_performance")()

Figure Missing

It looks like I over-trained it, as the loss is getting high. (Also note that I used this notebook to troubleshoot so there was actually one extra epoch that isn't shown).

End

Citation

This is the paper where the dataset was originally used.

  • Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

BBC News Classification

Beginning

Imports

Python

from functools import partial
from pathlib import Path
import csv
import random

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import hvplot.pandas
import numpy
import pandas
import spacy
import tensorflow

Graeae

from graeae import EmbedHoloviews, SubPathLoader, Timer

Setup

The Timer

TIMER = Timer()

The Environment

ENVIRONMENT = SubPathLoader('DATASETS')

Spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_lg")

Plotting

SLUG = "bbc-news-classification"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")

Middle

Load the Datasets

path = Path(ENVIRONMENT["BBC_NEWS"]).expanduser()

texts = []
labels = []
with TIMER:
    with path.open() as csvfile:
        lines = csv.DictReader(csvfile)
        for line in lines:
            labels.append(line["category"])
            texts.append(nlp(line["text"]))
WARNING: Logging before flag parsing goes to stderr.
I0908 13:32:14.804769 139839933974336 environment.py:35] Environment Path: /home/athena/.env
I0908 13:32:14.806000 139839933974336 environment.py:90] Environment Path: /home/athena/.config/datasets/env
2019-09-08 13:32:14,806 graeae.timers.timer start: Started: 2019-09-08 13:32:14.806861
I0908 13:32:14.806965 139839933974336 timer.py:70] Started: 2019-09-08 13:32:14.806861
2019-09-08 13:33:37,430 graeae.timers.timer end: Ended: 2019-09-08 13:33:37.430228
I0908 13:33:37.430259 139839933974336 timer.py:77] Ended: 2019-09-08 13:33:37.430228
2019-09-08 13:33:37,431 graeae.timers.timer end: Elapsed: 0:01:22.623367
I0908 13:33:37.431128 139839933974336 timer.py:78] Elapsed: 0:01:22.623367

print(texts[random.randrange(len(texts))])
candidate resigns over bnp link a prospective candidate for the uk independence party (ukip) has resigned after admitting a  brief attachment  to the british national party(bnp).  nicholas betts-green  who had been selected to fight the suffolk coastal seat  quit after reports in a newspaper that he attended a bnp meeting. the former teacher confirmed he had attended the meeting but said that was the only contact he had with the group. mr betts-green resigned after being questioned by the party s leadership. a ukip spokesman said mr betts-green s resignation followed disclosures in the east anglian daily times last month about his attendance at a bnp meeting.  he did once attend a bnp meeting. he did not like what he saw and heard and will take no further part of it   the spokesman added. a meeting of suffolk coastal ukip members is due to be held next week to discuss a replacement. mr betts-green  of woodbridge  suffolk  has also resigned as ukip s branch chairman.

So, it looks like the text has been lower-cased but there's still punctuation and extra white-space.

print(f"Rows: {len(labels):,}")
print(f"Unique Labels: {len(set(labels)):,}")
Rows: 2,225
Unique Labels: 5

Since there's only five maybe we should plot it.

labels_frame = pandas.DataFrame({"label": labels})
counts = labels_frame.label.value_counts().reset_index().rename(
    columns={"index": "Category", "label": "Articles"})
plot = counts.hvplot.bar("Category", "Articles").opts(
    title="Count of BBC News Articles by Category",
    height=800, width=1000)
Embed(plot=plot, file_name="bbc_category_counts")()

Figure Missing

It looks like the categories are somewhat unevenly distributed. Now to normalize the tokens.

with TIMER:
    cleaned = [[token.lemma_ for token in text if not any((token.is_stop, token.is_space, token.is_punct))]
               for text in texts]
2019-09-08 13:33:40,257 graeae.timers.timer start: Started: 2019-09-08 13:33:40.257908
I0908 13:33:40.257930 139839933974336 timer.py:70] Started: 2019-09-08 13:33:40.257908
2019-09-08 13:33:40,810 graeae.timers.timer end: Ended: 2019-09-08 13:33:40.810135
I0908 13:33:40.810176 139839933974336 timer.py:77] Ended: 2019-09-08 13:33:40.810135
2019-09-08 13:33:40,811 graeae.timers.timer end: Elapsed: 0:00:00.552227
I0908 13:33:40.811067 139839933974336 timer.py:78] Elapsed: 0:00:00.552227

The Tokenizers

Even though I've already tokenized the texts, we need to eventually one-hot-encode them so I'll use the tensorflow keras Tokenizer.

Note: The labels tokenizer doesn't get the out-of-vocabulary token, only the text-tokenizer does.

tokenizer = Tokenizer(num_words=1000, oov_token="<OOV>")
labels_tokenizer = Tokenizer()
labels_tokenizer.fit_on_texts(labels)

The num_words is the total amount of words that will be kept in the word index - I don't know why a thousand, I just found that in the "answer" notebook. The oov_token is what's used when a word is encountered outside of the words we're building into our word-index (Out Of Vocabulary). The next step is to create the word-index by fitting the tokenizer to the text.

with TIMER:
    tokenizer.fit_on_texts(cleaned)
2019-09-08 14:59:30,671 graeae.timers.timer start: Started: 2019-09-08 14:59:30.671536
I0908 14:59:30.671563 139839933974336 timer.py:70] Started: 2019-09-08 14:59:30.671536
2019-09-08 14:59:30,862 graeae.timers.timer end: Ended: 2019-09-08 14:59:30.862483
I0908 14:59:30.862523 139839933974336 timer.py:77] Ended: 2019-09-08 14:59:30.862483
2019-09-08 14:59:30,863 graeae.timers.timer end: Elapsed: 0:00:00.190947
I0908 14:59:30.863504 139839933974336 timer.py:78] Elapsed: 0:00:00.190947

The tokenizer now has a dictionary named word_index that holds the words:index pairs for all the tokens found (it only uses the num_words when you call tokenizer's methods according to Stack Overflow).

print(f"{len(tokenizer.word_index):,}")
24,339

Making the Sequences

I've trained the Tokenizer so that it has a word-index, but now we have to one hot encode our texts and pad them so they're all the same length.

MAX_LENGTH = 120
sequences = tokenizer.texts_to_sequences(cleaned)
padded = pad_sequences(sequences, padding="post", maxlen=MAX_LENGTH)
labels_sequenced = labels_tokenizer.texts_to_sequences(labels)

Make training and testing sets

TESTING = 0.2
x_train, x_test, y_train, y_test = train_test_split(
    padded, labels_sequenced,
    test_size=TESTING)
x_train, x_validation, y_train, y_validation = train_test_split(
    x_train, y_train, test_size=TESTING)

y_train = numpy.array(y_train)
y_test = numpy.array(y_test)
y_validation = numpy.array(y_validation)

print(f"Training: {x_train.shape}")
print(f"Validation: {x_validation.shape}")
print(f"Testing: {x_test.shape}")
Training: (1424, 120)
Validation: (356, 120)
Testing: (445, 120)

Note: I originally forgot to pass the TESTING variable with the keyword test_size and got an error that I couldn't use a Singleton array - don't forget the keywords when you pass in anything other than the data to train_test_split.

The Model

vocabulary_size = 1000
embedding_dimension = 16
max_length=120

model = tensorflow.keras.Sequential([
    layers.Embedding(vocabulary_size, embedding_dimension,
                     input_length=max_length),
    layers.GlobalAveragePooling1D(),
    layers.Dense(24, activation="relu"),
    layers.Dense(6, activation="softmax"),
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 120, 16)           16000     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                408       
_________________________________________________________________
dense_3 (Dense)              (None, 6)                 150       
=================================================================
Total params: 16,558
Trainable params: 16,558
Non-trainable params: 0
_________________________________________________________________
None
model.fit(x_train, y_train, epochs=30,
          validation_data=(x_validation, y_validation), verbose=2)
Train on 1424 samples, validate on 356 samples
Epoch 1/30
1424/1424 - 0s - loss: 1.7623 - accuracy: 0.2879 - val_loss: 1.7257 - val_accuracy: 0.5000
Epoch 2/30
1424/1424 - 0s - loss: 1.6871 - accuracy: 0.5190 - val_loss: 1.6332 - val_accuracy: 0.5281
Epoch 3/30
1424/1424 - 0s - loss: 1.5814 - accuracy: 0.4782 - val_loss: 1.5118 - val_accuracy: 0.4944
Epoch 4/30
1424/1424 - 0s - loss: 1.4417 - accuracy: 0.4677 - val_loss: 1.3543 - val_accuracy: 0.5365
Epoch 5/30
1424/1424 - 0s - loss: 1.2706 - accuracy: 0.5934 - val_loss: 1.1850 - val_accuracy: 0.7022
Epoch 6/30
1424/1424 - 0s - loss: 1.1075 - accuracy: 0.6749 - val_loss: 1.0387 - val_accuracy: 0.8006
Epoch 7/30
1424/1424 - 0s - loss: 0.9606 - accuracy: 0.8483 - val_loss: 0.9081 - val_accuracy: 0.8567
Epoch 8/30
1424/1424 - 0s - loss: 0.8244 - accuracy: 0.8869 - val_loss: 0.7893 - val_accuracy: 0.8848
Epoch 9/30
1424/1424 - 0s - loss: 0.6963 - accuracy: 0.9164 - val_loss: 0.6747 - val_accuracy: 0.8961
Epoch 10/30
1424/1424 - 0s - loss: 0.5815 - accuracy: 0.9228 - val_loss: 0.5767 - val_accuracy: 0.9185
Epoch 11/30
1424/1424 - 0s - loss: 0.4831 - accuracy: 0.9375 - val_loss: 0.4890 - val_accuracy: 0.9270
Epoch 12/30
1424/1424 - 0s - loss: 0.3991 - accuracy: 0.9473 - val_loss: 0.4195 - val_accuracy: 0.9326
Epoch 13/30
1424/1424 - 0s - loss: 0.3321 - accuracy: 0.9508 - val_loss: 0.3669 - val_accuracy: 0.9438
Epoch 14/30
1424/1424 - 0s - loss: 0.2800 - accuracy: 0.9572 - val_loss: 0.3268 - val_accuracy: 0.9494
Epoch 15/30
1424/1424 - 0s - loss: 0.2385 - accuracy: 0.9656 - val_loss: 0.2936 - val_accuracy: 0.9438
Epoch 16/30
1424/1424 - 0s - loss: 0.2053 - accuracy: 0.9740 - val_loss: 0.2693 - val_accuracy: 0.9466
Epoch 17/30
1424/1424 - 0s - loss: 0.1775 - accuracy: 0.9761 - val_loss: 0.2501 - val_accuracy: 0.9466
Epoch 18/30
1424/1424 - 0s - loss: 0.1557 - accuracy: 0.9789 - val_loss: 0.2332 - val_accuracy: 0.9494
Epoch 19/30
1424/1424 - 0s - loss: 0.1362 - accuracy: 0.9831 - val_loss: 0.2189 - val_accuracy: 0.9522
Epoch 20/30
1424/1424 - 0s - loss: 0.1209 - accuracy: 0.9853 - val_loss: 0.2082 - val_accuracy: 0.9551
Epoch 21/30
1424/1424 - 0s - loss: 0.1070 - accuracy: 0.9860 - val_loss: 0.1979 - val_accuracy: 0.9579
Epoch 22/30
1424/1424 - 0s - loss: 0.0952 - accuracy: 0.9888 - val_loss: 0.1897 - val_accuracy: 0.9551
Epoch 23/30
1424/1424 - 0s - loss: 0.0854 - accuracy: 0.9902 - val_loss: 0.1815 - val_accuracy: 0.9579
Epoch 24/30
1424/1424 - 0s - loss: 0.0765 - accuracy: 0.9916 - val_loss: 0.1761 - val_accuracy: 0.9522
Epoch 25/30
1424/1424 - 0s - loss: 0.0689 - accuracy: 0.9930 - val_loss: 0.1729 - val_accuracy: 0.9579
Epoch 26/30
1424/1424 - 0s - loss: 0.0618 - accuracy: 0.9951 - val_loss: 0.1680 - val_accuracy: 0.9551
Epoch 27/30
1424/1424 - 0s - loss: 0.0559 - accuracy: 0.9958 - val_loss: 0.1633 - val_accuracy: 0.9551
Epoch 28/30
1424/1424 - 0s - loss: 0.0505 - accuracy: 0.9958 - val_loss: 0.1594 - val_accuracy: 0.9579
Epoch 29/30
1424/1424 - 0s - loss: 0.0457 - accuracy: 0.9965 - val_loss: 0.1559 - val_accuracy: 0.9522
Epoch 30/30
1424/1424 - 0s - loss: 0.0416 - accuracy: 0.9972 - val_loss: 0.1544 - val_accuracy: 0.9551

It seems to get good suprisingly fast - it might be overfitting toward the end.

loss, accuracy =model.evaluate(x_test, y_test, verbose=0)
print(f"Loss: {loss: .2f} Accuracy: {accuracy:.2f}")
Loss:  0.16 Accuracy: 0.95

It does pretty well, even on the test set.

Plotting the Performance

data = pandas.DataFrame(model.history.history)
plot = data.hvplot().opts(title="Training Performance", width=1000, height=800)
Embed(plot=plot, file_name="model_performance")()

Figure Missing

Unlike with the image classifications, the validation performance never quite matches the training performance (although it's quite good), probably because we aren't doing any kind of augmentation the way you tend to do with images.

End

Okay, so we seem to have a decent model, but is that really the end-game? No, we want to be able to predict what classification a new input should get.

index_to_label = {value:key for (key, value) in labels_tokenizer.word_index.items()}

def category(text: str) -> None:
    """Categorizes the text

    Args:
     text: text to categorize
    """
    text = tokenizer.texts_to_sequences([text])
    predictions = model.predict(pad_sequences(text, maxlen=MAX_LENGTH))
    print(f"Predicted Category: {index_to_label[predictions.argmax()]}")
    return
text = "crickets are nutritious and delicious but make for such a silly game"
category(text)
Predicted Category: sport

text = "i like butts that are big and round, something something like a xxx throw down, and so does the house of parliament"
category(text)
Predicted Category: sport

It kind of looks like it's biased toward sports.

text = "tv future hand viewer home theatre"
category(text)
Predicted Category: sport

Something isn't right here.

Cleaning the BBC News Archive

Beginning

This is an initial look at cleaning up a text dataset from the BBC News archives. Although the exercise sites this as the source the dataset provided doesn't look like the actual raw dataset which is broken up into folders that classify the contents and each news item is in a separate file. Instead we're starting with a partially pre-processed CSV that has been lower-cased and the classification is given as the first column in the dataset.

Imports

Python

from pathlib import Path

PyPi

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas

Graeae

from graeae import SubPathLoader, Timer

Set Up

The Environment

ENVIRONMENT = SubPathLoader("DATASETS")

The Timer

TIMER = Timer()

Middle

The DataSet

bbc_path = Path(ENVIRONMENT["BBC_NEWS"]).expanduser()
with TIMER:
    data = pandas.read_csv(bbc_path/"bbc-text.csv")
2019-08-25 18:51:38,411 graeae.timers.timer start: Started: 2019-08-25 18:51:38.411196
2019-08-25 18:51:38,658 graeae.timers.timer end: Ended: 2019-08-25 18:51:38.658181
2019-08-25 18:51:38,658 graeae.timers.timer end: Elapsed: 0:00:00.246985

print(data.shape)
(2225, 2)

print(data.sample().iloc[0])
category                                                sport
text        bell set for england debut bath prop duncan be...
Name: 2134, dtype: object

So we have two columns - category and text, text being the one we have to clean up.

print(data.text.dtype)
object

That's not such an informative answer, but I checked and each row of text is a single string.

The Tokenizer

The Keras Tokenizer tokenizes the text for us as well as removing the punctuation, lower-casing the text, and some other things. We're also going to use a Out-of-Vocabulary token of "<OOV>" to identify words that are outside of the vocabulary when converting new texts to sequences.

tokenizer = Tokenizer(oov_token="<OOV>", num_words=100)
tokenizer.fit_on_texts(data.text)
word_index = tokenizer.word_index
print(len(word_index))
29727

The word-index is a dict that maps words found in the documents to counts.

Convert the Texts To Sequences

We're going to convert each of our texts to a sequence of numbers representing the words in them (one-hot-encoding). The pad_sequences function adds zeros to the end of sequences that are shorter than the longest one so that they are all the same size.

sequences = tokenizer.texts_to_sequences(data.text)
padded = pad_sequences(sequences, padding="post")
print(padded[0])
print(padded.shape)
[1 1 7 ... 0 0 0]
(2225, 4491)

Strangely there doesn't appear to be a good way to use stopwords. Maybe sklearn is more appropriate here.

vectorizer = CountVectorizer(stop_words=stopwords.words("english"),
                             lowercase=True, min_df=3,
                             max_df=0.9, max_features=5000)
vectors = vectorizer.fit_transform(data.text)

End

Sources

The Original Dataset

  • D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. [PDF] [BibTeX].

Sign Language Exercise

Beginning

This data I'm using is the Sign-Language MNIST set (hosted on Kaggle). It's a drop-in replacement for the MNIST dataset that contains images of hands showing letters in American Sign Language that was created by taking 1,704 photos of hands showing letters in the alphabet and then using ImageMagick to alter the photos to create a training set with 27,455 images and a test set with 7,172 images.

Imports

Python

from functools import partial
from pathlib import Path

PyPi

from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing import image as keras_image
from tensorflow.keras.utils import to_categorical
import hvplot.pandas
import matplotlib.pyplot as pyplot
import matplotlib.image as matplotlib_image

import numpy
import pandas
import seaborn
import tensorflow

Graeae

from graeae import EmbedHoloviews, SubPathLoader, Timer

Set Up

Plotting

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "figure.figsize": (8, 6)},
            font_scale=1)
FIGURE_SIZE = (12, 10)

Embed = partial(
    EmbedHoloviews,
    folder_path="../../files/posts/keras/sign-language-exercise/")

Timer

TIMER = Timer()

The Environment

ENVIRONMENT = SubPathLoader("DATASETS")

Middle

The Datasets

root_path = Path(ENVIRONMENT["SIGN_LANGUAGE_MNIST"]).expanduser()
def get_data(test_or_train: str) -> tuple:
    """Gets the MNIST data

    The pixels are reshaped so that they are 28x28
    Also, an extra dimension is added to make the shape:
     (<rows>, 28, 28, 1)

    Also converts the labels to a categorical (so there are 25 columns)

    Args:
     test_or_train: which data set to load

    Returns: 
     images, labels: numpy arrays with the data
    """
    path = root_path/f"sign_mnist_{test_or_train}.csv"
    data = pandas.read_csv(path) 
    labels = data.label
    labels = to_categorical(labels)
    pixels = data[[column for column in data.columns if column.startswith("pixel")]]
    pixels = pixels.values.reshape(((len(pixels), 28, 28, 1)))
    print(labels.shape)
    print(pixels.shape)
    return pixels, labels
  • Training

    The data is a CSV with the first column being the labels and the rest of the columns holding the pixel values. To make it work with our networks we need to re-shape the data so that we have a shape of (<rows>, 28, 28). The 28 comes from the fact that there are 784 pixel columns(28 x 28 = 784).

    train_images, train_labels = get_data("train")
    assert train_images.shape == (27455, 28, 28, 1)
    assert train_labels.shape == (27455, 25)
    
    (27455, 25)
    (27455, 28, 28, 1)
    
    

    As you can see, there's a lot of columns in the original set. The first one is the "label" and the rest are the "pixel" columns.

    test_images, test_labels = get_data("test")
    assert test_images.shape == (7172, 28, 28, 1)
    assert test_labels.shape == (7172, 25)
    
    (7172, 25)
    (7172, 28, 28, 1)
    
    

    Note: The original exercise calls for doing this with the python csv module. But why?

Data Generators

training_data_generator = ImageDataGenerator(
    rescale = 1./255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest')

validation_data_generator = ImageDataGenerator(rescale = 1./255)

train_generator = training_data_generator.flow(
        train_images, train_labels,
)

validation_generator = validation_data_generator.flow(
        test_images, test_labels,
)

The Model

Part of the exercise requires that we only use two convolutional layers.

model = tensorflow.keras.models.Sequential([
    # Input Layer/convolution
    tensorflow.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
    tensorflow.keras.layers.MaxPooling2D(2, 2),
    # The second convolution
    tensorflow.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tensorflow.keras.layers.MaxPooling2D(2,2),
    # Flatten
    tensorflow.keras.layers.Flatten(),
    tensorflow.keras.layers.Dropout(0.5),
    # Fully-connected and output layers
    tensorflow.keras.layers.Dense(512, activation='relu'),
    tensorflow.keras.layers.Dense(25, activation='softmax'),
])
model.summary()
Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_12 (Conv2D)           (None, 26, 26, 64)        640       
_________________________________________________________________
max_pooling2d_12 (MaxPooling (None, 13, 13, 64)        0         
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 11, 11, 128)       73856     
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 5, 5, 128)         0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 3200)              0         
_________________________________________________________________
dense_12 (Dense)             (None, 512)               1638912   
_________________________________________________________________
dense_13 (Dense)             (None, 25)                12825     
=================================================================
Total params: 1,726,233
Trainable params: 1,726,233
Non-trainable params: 0
_________________________________________________________________

Train It

model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
MODELS = Path("~/models/sign-language-mnist/").expanduser()
assert MODELS.is_dir()
best_model = MODELS/"two-cnn-layers.hdf5"
checkpoint = tensorflow.keras.callbacks.ModelCheckpoint(
    str(best_model), monitor="val_accuracy", verbose=1, 
    save_best_only=True)

with TIMER:
    model.fit_generator(generator=train_generator,
                        epochs=25,
                        callbacks=[checkpoint],
                        validation_data = validation_generator,
                        verbose=2)
2019-08-25 16:25:13,710 graeae.timers.timer start: Started: 2019-08-25 16:25:13.710604
I0825 16:25:13.710640 140637170140992 timer.py:70] Started: 2019-08-25 16:25:13.710604
Epoch 1/25

Epoch 00001: val_accuracy improved from -inf to 0.45427, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 8s - loss: 2.6016 - accuracy: 0.2048 - val_loss: 1.5503 - val_accuracy: 0.4543
Epoch 2/25

Epoch 00002: val_accuracy improved from 0.45427 to 0.71403, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 1.8267 - accuracy: 0.4160 - val_loss: 0.8762 - val_accuracy: 0.7140
Epoch 3/25

Epoch 00003: val_accuracy improved from 0.71403 to 0.74888, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 1.4297 - accuracy: 0.5323 - val_loss: 0.7413 - val_accuracy: 0.7489
Epoch 4/25

Epoch 00004: val_accuracy improved from 0.74888 to 0.76157, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 1.1984 - accuracy: 0.6100 - val_loss: 0.6402 - val_accuracy: 0.7616
Epoch 5/25

Epoch 00005: val_accuracy improved from 0.76157 to 0.84816, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 1.0498 - accuracy: 0.6570 - val_loss: 0.4581 - val_accuracy: 0.8482
Epoch 6/25

Epoch 00006: val_accuracy improved from 0.84816 to 0.85778, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 0.9340 - accuracy: 0.6944 - val_loss: 0.4195 - val_accuracy: 0.8578
Epoch 7/25

Epoch 00007: val_accuracy improved from 0.85778 to 0.90240, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 0.8522 - accuracy: 0.7189 - val_loss: 0.3270 - val_accuracy: 0.9024
Epoch 8/25

Epoch 00008: val_accuracy did not improve from 0.90240
858/858 - 7s - loss: 0.7963 - accuracy: 0.7410 - val_loss: 0.3144 - val_accuracy: 0.8887
Epoch 9/25

Epoch 00009: val_accuracy did not improve from 0.90240
858/858 - 7s - loss: 0.7388 - accuracy: 0.7560 - val_loss: 0.3184 - val_accuracy: 0.8984
Epoch 10/25

Epoch 00010: val_accuracy improved from 0.90240 to 0.92777, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 0.7127 - accuracy: 0.7692 - val_loss: 0.2045 - val_accuracy: 0.9278
Epoch 11/25

Epoch 00011: val_accuracy improved from 0.92777 to 0.93572, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 9s - loss: 0.6798 - accuracy: 0.7792 - val_loss: 0.1813 - val_accuracy: 0.9357
Epoch 12/25

Epoch 00012: val_accuracy improved from 0.93572 to 0.94046, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 0.6506 - accuracy: 0.7875 - val_loss: 0.1857 - val_accuracy: 0.9405
Epoch 13/25

Epoch 00013: val_accuracy improved from 0.94046 to 0.94074, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 0.6365 - accuracy: 0.7941 - val_loss: 0.1691 - val_accuracy: 0.9407
Epoch 14/25

Epoch 00014: val_accuracy improved from 0.94074 to 0.95706, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 0.6127 - accuracy: 0.8028 - val_loss: 0.1426 - val_accuracy: 0.9571
Epoch 15/25

Epoch 00015: val_accuracy did not improve from 0.95706
858/858 - 7s - loss: 0.6009 - accuracy: 0.8076 - val_loss: 0.1925 - val_accuracy: 0.9265
Epoch 16/25

Epoch 00016: val_accuracy improved from 0.95706 to 0.96207, saving model to /home/athena/models/sign-language-mnist/two-cnn-layers.hdf5
858/858 - 7s - loss: 0.5883 - accuracy: 0.8121 - val_loss: 0.1393 - val_accuracy: 0.9621
Epoch 17/25

Epoch 00017: val_accuracy did not improve from 0.96207
858/858 - 7s - loss: 0.5785 - accuracy: 0.8127 - val_loss: 0.2188 - val_accuracy: 0.9250
Epoch 18/25

Epoch 00018: val_accuracy did not improve from 0.96207
858/858 - 7s - loss: 0.5728 - accuracy: 0.8158 - val_loss: 0.2003 - val_accuracy: 0.9350
Epoch 19/25

Epoch 00019: val_accuracy did not improve from 0.96207
858/858 - 7s - loss: 0.5633 - accuracy: 0.8225 - val_loss: 0.1452 - val_accuracy: 0.9578
Epoch 20/25

Epoch 00020: val_accuracy did not improve from 0.96207
858/858 - 7s - loss: 0.5536 - accuracy: 0.8223 - val_loss: 0.1341 - val_accuracy: 0.9605
Epoch 21/25

Epoch 00021: val_accuracy did not improve from 0.96207
858/858 - 8s - loss: 0.5477 - accuracy: 0.8252 - val_loss: 0.1500 - val_accuracy: 0.9442
Epoch 22/25

Epoch 00022: val_accuracy did not improve from 0.96207
858/858 - 7s - loss: 0.5367 - accuracy: 0.8291 - val_loss: 0.1435 - val_accuracy: 0.9568
Epoch 23/25

Epoch 00023: val_accuracy did not improve from 0.96207
858/858 - 7s - loss: 0.5425 - accuracy: 0.8336 - val_loss: 0.1598 - val_accuracy: 0.9615
Epoch 24/25

Epoch 00024: val_accuracy did not improve from 0.96207
858/858 - 8s - loss: 0.5243 - accuracy: 0.8330 - val_loss: 0.1749 - val_accuracy: 0.9483
Epoch 25/25

Epoch 00025: val_accuracy did not improve from 0.96207
858/858 - 7s - loss: 0.5163 - accuracy: 0.8379 - val_loss: 0.1353 - val_accuracy: 0.9587
2019-08-25 16:28:20,707 graeae.timers.timer end: Ended: 2019-08-25 16:28:20.707567
I0825 16:28:20.707660 140637170140992 timer.py:77] Ended: 2019-08-25 16:28:20.707567
2019-08-25 16:28:20,712 graeae.timers.timer end: Elapsed: 0:03:06.996963
I0825 16:28:20.712478 140637170140992 timer.py:78] Elapsed: 0:03:06.996963
predictor = load_model(best_model)
data = pandas.DataFrame(model.history.history)
plot = data.hvplot().opts(title="Sign Language MNIST Training and Validation",
                          fontsize={"title": 16},
                          width=1000, height=800)
Embed(plot=plot, file_name="training")()

Figure Missing

I'm not sure why these small networks do so well, bit this one seems to be doing fairly well.

loss, accuracy=predictor.evaluate(test_images, test_labels, verbose=0)
print(f"Loss: {loss:.2f}, Accuracy: {accuracy:.2f}")
Loss: 4.36, Accuracy: 0.72

So, actually, the performance drops quite a bit outside of the training, even though I'm using the same data-set.

End

Source

Rock-Paper-Scissors

Beginning

Imports

Python

from functools import partial
from pathlib import Path
import hvplot.pandas
import numpy
import pandas
import random

PyPi

from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing import image as keras_image
import holoviews
import matplotlib.pyplot as pyplot
import matplotlib.image as matplotlib_image
import seaborn
import tensorflow

graeae

from graeae import EmbedHoloviews, SubPathLoader, Timer, ZipDownloader

Set Up

Plotting

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "figure.figsize": (8, 6)},
            font_scale=1)
FIGURE_SIZE = (12, 10)

Embed = partial(EmbedHoloviews,
                folder_path="../../files/posts/keras/rock-paper-scissors/")
holoviews.extension("bokeh")

The Timer

TIMER = Timer()

The Environment

ENVIRONMENT = SubPathLoader("DATASETS")

Middle

The Data

Downloading it

TRAINING_URL = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/rps.zip"
TEST_URL = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/rps-test-set.zip"
OUT_PATH = Path(ENVIRONMENT["ROCK_PAPER_SCISSORS"]).expanduser()
download_train = ZipDownloader(TRAINING_URL, OUT_PATH/"train")
download_test = ZipDownloader(TEST_URL, OUT_PATH/"test")
download_train()
download_test()
I0825 15:06:11.577721 139626236733248 environment.py:35] Environment Path: /home/athena/.env
I0825 15:06:11.578975 139626236733248 environment.py:90] Environment Path: /home/athena/.config/datasets/env
Files exist, not downloading
Files exist, not downloading

The data structure for the folders is fairly deep, so I'll make some shortcuts.

TRAINING = OUT_PATH/"train/rps"
rocks = TRAINING/"rock"
papers = TRAINING/"paper"
scissors = TRAINING/"scissors"
assert papers.is_dir()
assert rocks.is_dir()
assert scissors.is_dir()
rock_images = list(rocks.iterdir())
paper_images = list(papers.iterdir())
scissors_images = list(scissors.iterdir())
print(f"Rocks: {len(rock_images):,}")
print(f"Papers: {len(paper_images):,}")
print(f"Scissors: {len(scissors_images):,}")
Rocks: 840
Papers: 840
Scissors: 840

Some Examples

count=1
rock_sample = random.choice(rock_images[:count])

image = matplotlib_image.imread(str(rock_sample))
pyplot.title("Rock")
pyplot.imshow(image)
pyplot.axis('Off')
pyplot.show()

rock.png

paper_sample = random.choice(paper_images)
image = matplotlib_image.imread(str(paper_sample))
pyplot.title("Paper")
pyplot.imshow(image)
pyplot.axis('Off')
pyplot.show()

paper.png

scissors_sample = random.choice(scissors_images)
image = matplotlib_image.imread(str(scissors_sample))
pyplot.title("Scissors")
pyplot.imshow(image)
pyplot.axis('Off')
pyplot.show()

scissors.png

Data Generators

Note: I was originally using keras_preprocessing.image.ImageDataGenerator and getting

AttributeError: 'DirectoryIterator' object has no attribute 'shape'

Make sure to use tensorflow.keras.preprocessing.image.ImageDataGenerator instead.

VALIDATION = OUT_PATH/"test/rps-test-set"
training_data_generator = ImageDataGenerator(
      rescale = 1./255,
          rotation_range=40,
      width_shift_range=0.2,
      height_shift_range=0.2,
      shear_range=0.2,
      zoom_range=0.2,
      horizontal_flip=True,
      fill_mode='nearest')

validation_data_generator = ImageDataGenerator(rescale = 1./255)

train_generator = training_data_generator.flow_from_directory(
        TRAINING,
        target_size=(150,150),
        class_mode='categorical'
)

validation_generator = validation_data_generator.flow_from_directory(
        VALIDATION,
        target_size=(150,150),
        class_mode='categorical'
)
Found 2520 images belonging to 3 classes.
Found 372 images belonging to 3 classes.

A Four-CNN Model

Definition

This is a hand-crafted, relatively shallow Convolutional Neural Network. The input shape matches our target_size arguments for the data-generators. There are four convolutional layers with a filter size of 3 x 3 each follewd by a max-pooling layer. The first two layers have 64 nodes while the two following those have 128 nodes. The convolution layers are followed by a layer to flatten the input and add dropout before reaching our fully connected and output layer which uses softmax to predict the most likely category. Since we have three categories (rock, paper, or scissors) the final layer has three nodes.

model = tensorflow.keras.models.Sequential([
    # Input Layer/convolution
    tensorflow.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(150, 150, 3)),
    tensorflow.keras.layers.MaxPooling2D(2, 2),
    # The second convolution
    tensorflow.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tensorflow.keras.layers.MaxPooling2D(2,2),
    # The third convolution
    tensorflow.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tensorflow.keras.layers.MaxPooling2D(2,2),
    # The fourth convolution
    tensorflow.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tensorflow.keras.layers.MaxPooling2D(2,2),
    # Flatten
    tensorflow.keras.layers.Flatten(),
    tensorflow.keras.layers.Dropout(0.5),
    # Fully-connected and output layers
    tensorflow.keras.layers.Dense(512, activation='relu'),
    tensorflow.keras.layers.Dense(3, activation='softmax')
])

Here's a summary of the layers.

model.summary()
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_16 (Conv2D)           (None, 148, 148, 64)      1792      
_________________________________________________________________
max_pooling2d_16 (MaxPooling (None, 74, 74, 64)        0         
_________________________________________________________________
conv2d_17 (Conv2D)           (None, 72, 72, 64)        36928     
_________________________________________________________________
max_pooling2d_17 (MaxPooling (None, 36, 36, 64)        0         
_________________________________________________________________
conv2d_18 (Conv2D)           (None, 34, 34, 128)       73856     
_________________________________________________________________
max_pooling2d_18 (MaxPooling (None, 17, 17, 128)       0         
_________________________________________________________________
conv2d_19 (Conv2D)           (None, 15, 15, 128)       147584    
_________________________________________________________________
max_pooling2d_19 (MaxPooling (None, 7, 7, 128)         0         
_________________________________________________________________
flatten_4 (Flatten)          (None, 6272)              0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 6272)              0         
_________________________________________________________________
dense_8 (Dense)              (None, 512)               3211776   
_________________________________________________________________
dense_9 (Dense)              (None, 3)                 1539      
=================================================================
Total params: 3,473,475
Trainable params: 3,473,475
Non-trainable params: 0
_________________________________________________________________

You can see that the convolutional layers lose two pixels on output, so the filters are stopping when their edges match the image (so the 3 x 3 filter stops with the center one pixel away from the edge of the image). Additionally, our max-pooling layers are cutting the size of the convolutional layers' output in half, so as we progress through the network the inputs are getting smaller and smaller before reaching the fully-connected layers.

Compile and Fit

Now we need to compile and train the model.

Note: The metrics can change with your settings - make sure the monitor= parameter is pointing to a key in the history. If you see this in the output:

Can save best model only with val_acc available, skipping.

You might have the wrong name for your metric (it isn't val_acc).

model.compile(loss = 'categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
MODELS = Path("~/models/rock-paper-scissors/").expanduser()
best_model = MODELS/"four-layer-cnn.hdf5"
checkpoint = tensorflow.keras.callbacks.ModelCheckpoint(
    str(best_model), monitor="val_accuracy", verbose=1, 
    save_best_only=True)

with TIMER:
    model.fit_generator(generator=train_generator,
                        epochs=25,
                        callbacks=[checkpoint],
                        validation_data = validation_generator,
                        verbose=2)
2019-08-25 15:06:17,145 graeae.timers.timer start: Started: 2019-08-25 15:06:17.145536
I0825 15:06:17.145575 139626236733248 timer.py:70] Started: 2019-08-25 15:06:17.145536
Epoch 1/25

Epoch 00001: val_accuracy improved from -inf to 0.61559, saving model to /home/athena/models/rock-paper-scissors/four-layer-cnn.hdf5
79/79 - 15s - loss: 1.1174 - accuracy: 0.3996 - val_loss: 0.8997 - val_accuracy: 0.6156
Epoch 2/25

Epoch 00002: val_accuracy improved from 0.61559 to 0.93817, saving model to /home/athena/models/rock-paper-scissors/four-layer-cnn.hdf5
79/79 - 14s - loss: 0.8115 - accuracy: 0.6381 - val_loss: 0.2403 - val_accuracy: 0.9382
Epoch 3/25

Epoch 00003: val_accuracy improved from 0.93817 to 0.97043, saving model to /home/athena/models/rock-paper-scissors/four-layer-cnn.hdf5
79/79 - 14s - loss: 0.5604 - accuracy: 0.7750 - val_loss: 0.2333 - val_accuracy: 0.9704
Epoch 4/25

Epoch 00004: val_accuracy improved from 0.97043 to 0.98387, saving model to /home/athena/models/rock-paper-scissors/four-layer-cnn.hdf5
79/79 - 14s - loss: 0.3926 - accuracy: 0.8496 - val_loss: 0.0681 - val_accuracy: 0.9839
Epoch 5/25

Epoch 00005: val_accuracy improved from 0.98387 to 0.99194, saving model to /home/athena/models/rock-paper-scissors/four-layer-cnn.hdf5
79/79 - 14s - loss: 0.2746 - accuracy: 0.8925 - val_loss: 0.0395 - val_accuracy: 0.9919
Epoch 6/25

Epoch 00006: val_accuracy did not improve from 0.99194
79/79 - 14s - loss: 0.2018 - accuracy: 0.9246 - val_loss: 0.1427 - val_accuracy: 0.9328
Epoch 7/25

Epoch 00007: val_accuracy did not improve from 0.99194
79/79 - 14s - loss: 0.2052 - accuracy: 0.9238 - val_loss: 0.4212 - val_accuracy: 0.8253
Epoch 8/25

Epoch 00008: val_accuracy did not improve from 0.99194
79/79 - 14s - loss: 0.1649 - accuracy: 0.9460 - val_loss: 0.1079 - val_accuracy: 0.9597
Epoch 9/25

Epoch 00009: val_accuracy did not improve from 0.99194
79/79 - 14s - loss: 0.1678 - accuracy: 0.9452 - val_loss: 0.0782 - val_accuracy: 0.9597
Epoch 10/25

Epoch 00010: val_accuracy did not improve from 0.99194
79/79 - 14s - loss: 0.1388 - accuracy: 0.9508 - val_loss: 0.0425 - val_accuracy: 0.9731
Epoch 11/25

Epoch 00011: val_accuracy did not improve from 0.99194
79/79 - 14s - loss: 0.1207 - accuracy: 0.9611 - val_loss: 0.0758 - val_accuracy: 0.9570
Epoch 12/25

Epoch 00012: val_accuracy did not improve from 0.99194
79/79 - 14s - loss: 0.1195 - accuracy: 0.9639 - val_loss: 0.1392 - val_accuracy: 0.9489
Epoch 13/25

Epoch 00013: val_accuracy improved from 0.99194 to 1.00000, saving model to /home/athena/models/rock-paper-scissors/four-layer-cnn.hdf5
79/79 - 14s - loss: 0.1182 - accuracy: 0.9583 - val_loss: 0.0147 - val_accuracy: 1.0000
Epoch 14/25

Epoch 00014: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0959 - accuracy: 0.9722 - val_loss: 0.1264 - val_accuracy: 0.9543
Epoch 15/25

Epoch 00015: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.1225 - accuracy: 0.9643 - val_loss: 0.1124 - val_accuracy: 0.9677
Epoch 16/25

Epoch 00016: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0959 - accuracy: 0.9706 - val_loss: 0.0773 - val_accuracy: 0.9677
Epoch 17/25

Epoch 00017: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0817 - accuracy: 0.9687 - val_loss: 0.0120 - val_accuracy: 1.0000
Epoch 18/25

Epoch 00018: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.1308 - accuracy: 0.9627 - val_loss: 0.1058 - val_accuracy: 0.9758
Epoch 19/25

Epoch 00019: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0967 - accuracy: 0.9675 - val_loss: 0.0356 - val_accuracy: 0.9866
Epoch 20/25

Epoch 00020: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0785 - accuracy: 0.9726 - val_loss: 0.0474 - val_accuracy: 0.9704
Epoch 21/25

Epoch 00021: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0962 - accuracy: 0.9710 - val_loss: 0.0774 - val_accuracy: 0.9677
Epoch 22/25

Epoch 00022: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0802 - accuracy: 0.9754 - val_loss: 0.1592 - val_accuracy: 0.9516
Epoch 23/25

Epoch 00023: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0909 - accuracy: 0.9714 - val_loss: 0.1123 - val_accuracy: 0.9382
Epoch 24/25

Epoch 00024: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0573 - accuracy: 0.9782 - val_loss: 0.0609 - val_accuracy: 0.9785
Epoch 25/25

Epoch 00025: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0860 - accuracy: 0.9778 - val_loss: 0.1106 - val_accuracy: 0.9677
2019-08-25 15:12:11,360 graeae.timers.timer end: Ended: 2019-08-25 15:12:11.360361
I0825 15:12:11.360388 139626236733248 timer.py:77] Ended: 2019-08-25 15:12:11.360361
2019-08-25 15:12:11,361 graeae.timers.timer end: Elapsed: 0:05:54.214825
I0825 15:12:11.361701 139626236733248 timer.py:78] Elapsed: 0:05:54.214825

That did surprisingly well… is it really that easy a problem?

predictor = load_model(best_model)
data = pandas.DataFrame(model.history.history)
plot = data.hvplot().opts(title="Rock, Paper, Scissors Training and Validation", width=1000, height=800)
Embed(plot=plot, file_name="training")()

Figure Missing

Looking at the validation accuracy it appears that it starts to overfit at the end. Strangely, the validation loss, up until the overfitting, is lower than the training loss, and the validation accuracy is better almost throughout - perhaps this is because the image augmentation for the training set is too hard.

End

Some Test Images

base = Path("~/test_images").expanduser()
paper = base/"Rock-paper-scissors_(paper).png"

image_ = matplotlib_image.imread(str(paper))
pyplot.title("Paper Test Case")
pyplot.imshow(image)
pyplot.axis('Off')
pyplot.show()

test_paper.png

classifications = dict(zip(range(3), ("Paper", "Rock", "Scissors")))
image_ = keras_image.load_img(str(paper), target_size=(150, 150))
x = keras_image.img_to_array(image_)
x = numpy.expand_dims(x, axis=0)
images = numpy.vstack([x])
classes = predictor.predict(images, batch_size=10)
print(classifications[classes.argmax()])
Paper

base = Path("~/test_images").expanduser()
rock = base/"Rock-paper-scissors_(rock).png"

image = matplotlib_image.imread(str(rock))
pyplot.title("Rock Test Case")
pyplot.imshow(image)
pyplot.axis('Off')
pyplot.show()

test_rock.png

base = Path("~/test_images").expanduser()
rock = base/"Rock-paper-scissors_(rock).png"
image_ = keras_image.load_img(str(rock), target_size=(150, 150))
x = keras_image.img_to_array(image_)
x = numpy.expand_dims(x, axis=0)
images = numpy.vstack([x])
classes = predictor.predict(images, batch_size=10)
print(classifications[classes.argmax()])
Rock

base = Path("~/test_images").expanduser()
scissors = base/"Rock-paper-scissors_(scissors).png"

image = matplotlib_image.imread(str(scissors))
pyplot.title("Scissors Test Case")
pyplot.imshow(image)
pyplot.axis('Off')
pyplot.show()

test_scissors.png

image_ = keras_image.load_img(str(scissors), target_size=(150, 150))
x = keras_image.img_to_array(image_)
x = numpy.expand_dims(x, axis=0)
images = numpy.vstack([x])
classes = predictor.predict(images, batch_size=10)
print(classifications[classes.argmax()])
Paper

What If we re-train the model, will it get better?

with TIMER:
    model.fit_generator(generator=train_generator,
                        epochs=25,
                        callbacks=[checkpoint],
                        validation_data = validation_generator,
                        verbose=2)
2019-08-25 15:21:37,706 graeae.timers.timer start: Started: 2019-08-25 15:21:37.706175
I0825 15:21:37.706199 139626236733248 timer.py:70] Started: 2019-08-25 15:21:37.706175
Epoch 1/25

Epoch 00001: val_accuracy did not improve from 1.00000
79/79 - 15s - loss: 0.0792 - accuracy: 0.9798 - val_loss: 0.1101 - val_accuracy: 0.9543
Epoch 2/25

Epoch 00002: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0691 - accuracy: 0.9798 - val_loss: 0.1004 - val_accuracy: 0.9570
Epoch 3/25

Epoch 00003: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0850 - accuracy: 0.9762 - val_loss: 0.0098 - val_accuracy: 1.0000
Epoch 4/25

Epoch 00004: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0799 - accuracy: 0.9730 - val_loss: 0.1022 - val_accuracy: 0.9409
Epoch 5/25

Epoch 00005: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0767 - accuracy: 0.9758 - val_loss: 0.1134 - val_accuracy: 0.9328
Epoch 6/25

Epoch 00006: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0747 - accuracy: 0.9833 - val_loss: 0.0815 - val_accuracy: 0.9731
Epoch 7/25

Epoch 00007: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0680 - accuracy: 0.9817 - val_loss: 0.1476 - val_accuracy: 0.9059
Epoch 8/25

Epoch 00008: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0669 - accuracy: 0.9821 - val_loss: 0.0202 - val_accuracy: 0.9866
Epoch 9/25

Epoch 00009: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0809 - accuracy: 0.9774 - val_loss: 0.3860 - val_accuracy: 0.8844
Epoch 10/25

Epoch 00010: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0583 - accuracy: 0.9817 - val_loss: 0.0504 - val_accuracy: 0.9812
Epoch 11/25

Epoch 00011: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0691 - accuracy: 0.9806 - val_loss: 0.0979 - val_accuracy: 0.9624
Epoch 12/25

Epoch 00012: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0459 - accuracy: 0.9881 - val_loss: 0.1776 - val_accuracy: 0.9167
Epoch 13/25

Epoch 00013: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0648 - accuracy: 0.9821 - val_loss: 0.0770 - val_accuracy: 0.9435
Epoch 14/25

Epoch 00014: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0549 - accuracy: 0.9825 - val_loss: 0.0075 - val_accuracy: 1.0000
Epoch 15/25

Epoch 00015: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0575 - accuracy: 0.9829 - val_loss: 0.1787 - val_accuracy: 0.9167
Epoch 16/25

Epoch 00016: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0665 - accuracy: 0.9778 - val_loss: 0.0230 - val_accuracy: 0.9866
Epoch 17/25

Epoch 00017: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0557 - accuracy: 0.9825 - val_loss: 0.0431 - val_accuracy: 0.9785
Epoch 18/25

Epoch 00018: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0628 - accuracy: 0.9817 - val_loss: 0.2121 - val_accuracy: 0.8952
Epoch 19/25

Epoch 00019: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0580 - accuracy: 0.9841 - val_loss: 0.0705 - val_accuracy: 0.9651
Epoch 20/25

Epoch 00020: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0578 - accuracy: 0.9810 - val_loss: 0.3318 - val_accuracy: 0.8925
Epoch 21/25

Epoch 00021: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0500 - accuracy: 0.9821 - val_loss: 0.2106 - val_accuracy: 0.8925
Epoch 22/25

Epoch 00022: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0520 - accuracy: 0.9829 - val_loss: 0.1040 - val_accuracy: 0.9382
Epoch 23/25

Epoch 00023: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0693 - accuracy: 0.9853 - val_loss: 0.6132 - val_accuracy: 0.8575
Epoch 24/25

Epoch 00024: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0553 - accuracy: 0.9849 - val_loss: 0.3048 - val_accuracy: 0.8817
Epoch 25/25

Epoch 00025: val_accuracy did not improve from 1.00000
79/79 - 14s - loss: 0.0376 - accuracy: 0.9877 - val_loss: 0.0121 - val_accuracy: 1.0000
2019-08-25 15:27:32,278 graeae.timers.timer end: Ended: 2019-08-25 15:27:32.278250
I0825 15:27:32.278276 139626236733248 timer.py:77] Ended: 2019-08-25 15:27:32.278250
2019-08-25 15:27:32,279 graeae.timers.timer end: Elapsed: 0:05:54.572075
I0825 15:27:32.279404 139626236733248 timer.py:78] Elapsed: 0:05:54.572075

So, your validation went up to 100%, is it a super-classifier?

data = pandas.DataFrame(model.history.history)
plot = data.hvplot().opts(title="Rock, Paper, Scissors Re-Training and Re-Validation", width=1000, height=800)
Embed(plot=plot, file_name="re_training")()

Figure Missing

predictor = load_model(best_model)
classifications = dict(zip(range(3), ("Paper", "Rock", "Scissors")))
image_ = keras_image.load_img(str(paper), target_size=(150, 150))
x = keras_image.img_to_array(image_)
x = numpy.expand_dims(x, axis=0)
images = numpy.vstack([x])
classes = model.predict(images, batch_size=10)
print(classifications[classes.argmax()])
Rock

classifications = dict(zip(range(3), ("Paper", "Rock", "Scissors")))
image_ = keras_image.load_img(str(rock), target_size=(150, 150))
x = keras_image.img_to_array(image_)
x = numpy.expand_dims(x, axis=0)
images = numpy.vstack([x])
classes = model.predict(images, batch_size=10)
print(classifications[classes.argmax()])
Rock

classifications = dict(zip(range(3), ("Paper", "Rock", "Scissors")))
image_ = keras_image.load_img(str(scissors), target_size=(150, 150))
x = keras_image.img_to_array(image_)
x = numpy.expand_dims(x, axis=0)
images = numpy.vstack([x])
classes = model.predict(images, batch_size=10)
print(classifications[classes.argmax()])
Paper

I don't have a large test set, but just from these three it seems like the model got worse.

Sources