He Used Sarcasm

Cloistered Monkey

2019-09-21 19:01

Beginning

This is a look at fitting a model to detect sarcasm using a json blob from Laurence Moroney (https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json).

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path
from pprint import pprint
from urllib.parse import urlparse
import json

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import hvplot.pandas
import pandas
import tensorflow

Other

from graeae import (
    CountPercentage,
    EmbedHoloviews,
    TextDownloader,
    Timer
)

Set Up

The Timer

TIMER = Timer()

The Plotting

SLUG = "he-used-sarcasm"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")

The Data

URL = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json"
path = Path("~/data/datasets/text/sarcasm/sarcasm.json").expanduser()
downloader = TextDownloader(URL, path)
with TIMER:
    data = json.loads(downloader.download)

2019-09-22 15:05:27,302 graeae.timers.timer start: Started: 2019-09-22 15:05:27.301225
WARNING: Logging before flag parsing goes to stderr.
I0922 15:05:27.302001 139873020925760 timer.py:70] Started: 2019-09-22 15:05:27.301225
2019-09-22 15:05:27,306 [1mTextDownloader[0m download: /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it
I0922 15:05:27.306186 139873020925760 downloader.py:51] /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it
2019-09-22 15:05:27,367 graeae.timers.timer end: Ended: 2019-09-22 15:05:27.367036
I0922 15:05:27.367099 139873020925760 timer.py:77] Ended: 2019-09-22 15:05:27.367036
2019-09-22 15:05:27,369 graeae.timers.timer end: Elapsed: 0:00:00.065811
I0922 15:05:27.369417 139873020925760 timer.py:78] Elapsed: 0:00:00.065811

Middle

Looking At the Data

pprint(data[0])

{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
 'headline': "former versace store clerk sues over secret 'black code' for "
             'minority shoppers',
 'is_sarcastic': 0}

So our data is a dictionary with three keys - the source of the article, the headline of the article, and whether it's a sarcastic headline or not. There's no citation in the original notebook, but it looks like it might be this one on GitHub.

data = pandas.DataFrame(data)
data.loc[:, "site"] = data.article_link.apply(lambda link: urlparse(link).netloc)

CountPercentage(data.site, value_label="Site")()

Site	Count	Percent (%)
www.huffingtonpost.com	14403	53.93
www.theonion.com	5811	21.76
local.theonion.com	2852	10.68
politics.theonion.com	1767	6.62
entertainment.theonion.com	1194	4.47
www.huffingtonpost.comhttp:	503	1.88
sports.theonion.com	100	0.37
www.huffingtonpost.comhttps:	79	0.30

So, it looks like there's some problems with the URLs. I don't think that's important, but maybe I can clean up a little anyway.

print(
    data[
        data.site.str.contains("www.huffingtonpost.comhttp"
                               )].article_link.value_counts())

https://www.huffingtonpost.comhttp://nymag.com/daily/intelligencer/2016/05/hillary-clinton-candidacy.html                                                                                              2
https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostQueerVoices/videos/1153919084666530/                                                                                                    1
https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/chris-meloni-star-underground/56d8584f99ec6dca3d00000a                                                                          1
https://www.huffingtonpost.comhttp://www.thestreet.com/story/13223501/1/post-retirement-work-may-not-save-your-golden-years.html                                                                       1
https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostEntertainment/                                                                                                                          1
                                                                                                                                                                                                      ..
https://www.huffingtonpost.comhttp://nymag.com/thecut/2015/10/first-legal-abortionists-tell-their-stories.html?mid=twitter_nymag                                                                       1
https://www.huffingtonpost.comhttp://www.tampabay.com/blogs/the-buzz-florida-politics/marco-rubio-warming-up-to-donald-trump/2275308                                                                   1
https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/porn-to-pay-for-college/55aeadf62b8c2a2f6f000193                                                                                1
https://www.huffingtonpost.comhttps://www.thedodo.com/dog-mouth-taped-shut-facebook-1481874724.html                                                                                                    1
https://www.huffingtonpost.comhttps://www.washingtonpost.com/politics/ben-carson-to-tell-supporters-he-sees-no-path-forward-for-campaign/2016/03/02/d6bef352-d9b3-11e5-891a-4ed04f4213e8_story.html    1
Name: article_link, Length: 581, dtype: int64

That's kind of odd, I don't know what that means, maybe the Huffington Post was citing other sites? I went to go check the GitHub dataset I mentioned but it's actually much larger than this one so I don't know if it's really the source or not.

prefixes = ("www.huffingtonpost.comhttp:", "www.huffingtonpost.comhttps:")
for prefix in prefixes:
    data.loc[:, "site"] = data.site.str.replace(
        prefix,
        "www.huffingtonpost.com")

prefixes = ("local.theonion.com",
            "politics.theonion.com",
            "entertainment.theonion.com",
            "sports.theonion.com")

for prefix in prefixes:
    data.loc[:, "site"] = data.site.str.replace(prefix,
                                                "www.theonion.com")

counter = CountPercentage(data.site, value_label="Site")
counter()

Site	Count	Percent (%)
www.huffingtonpost.com	14985	56.10
www.theonion.com	11724	43.90

plot = counter.table.hvplot.bar(x="Site", y="Count").opts(
    title="Distribution by Site",
    width=1000,
    height=800)
Embed(plot=plot, file_name="site_distribution")()

counter = CountPercentage(data.is_sarcastic, value_label="Is Sarcastic")
counter()

Is Sarcastic	Count	Percent (%)
0	14,985	56.10
1	11,724	43.90

Given that the counts match I'm assuming anything from the Huffington Post is labeled as not sarcastic and anything from the onion is sarcastic.

assert all(data[data.site=="www.onion.com"].is_sarcastic)
assert not any(data[data.site=="www.huffingtonpost.com"].is_sarcastic)

Set Up the Tokenizing and Training Data

print(f"{len(data):,}")

26,709

Text = Namespace(
    vocabulary_size = 1000,
    embedding_dim = 16,
    max_length = 120,
    truncating_type='post',
    padding_type='post',
    out_of_vocabulary_tok = "<OOV>",
)

# this is actually the default for train_test_split
Training = Namespace(
    size = 0.75,
    epochs = 50,
    verbosity = 2,
    )

The Training and Testing Data

x_train, x_test, y_train, y_test = train_test_split(
    data.headline, data.is_sarcastic, train_size=Training.size,
)

The Tokenizer

tokenizer = Tokenizer(num_words=Text.vocabulary_size,
                      oov_token=Text.out_of_vocabulary_tok)

print(tokenizer.__doc__)

Text tokenization utility class.

    This class allows to vectorize a text corpus, by turning each
    text into either a sequence of integers (each integer being the index
    of a token in a dictionary) or into a vector where the coefficient
    for each token could be binary, based on word count, based on tf-idf...

    # Arguments
        num_words: the maximum number of words to keep, based
            on word frequency. Only the most common `num_words-1` words will
            be kept.
        filters: a string where each element is a character that will be
            filtered from the texts. The default is all punctuation, plus
            tabs and line breaks, minus the `'` character.
        lower: boolean. Whether to convert the texts to lowercase.
        split: str. Separator for word splitting.
        char_level: if True, every character will be treated as a token.
        oov_token: if given, it will be added to word_index and used to
            replace out-of-vocabulary words during text_to_sequence calls

    By default, all punctuation is removed, turning the texts into
    space-separated sequences of words
    (words maybe include the `'` character). These sequences are then
    split into lists of tokens. They will then be indexed or vectorized.

    `0` is a reserved index that won't be assigned to any word.

Now that we have a tokenizer we can tokenize our training headlines.

help(tokenizer.fit_on_texts)

Help on method fit_on_texts in module keras_preprocessing.text:

fit_on_texts(texts) method of keras_preprocessing.text.Tokenizer instance
    Updates internal vocabulary based on a list of texts.
    
    In the case where texts contains lists,
    we assume each entry of the lists to be a token.
    
    Required before using `texts_to_sequences` or `texts_to_matrix`.
    
    # Arguments
        texts: can be a list of strings,
            a generator of strings (for memory-efficiency),
            or a list of list of strings.

tokenizer.fit_on_texts(x_train)

Now that we've fit the headlines we can get the word index, a dict mapping words to their index.

word_index = tokenizer.word_index

Note that the tokenizer doesn't remove stop-words.

print("the" in word_index)

True

Now we'll convert the training headlines to sequences of numbers.

help(tokenizer.texts_to_sequences)

Help on method texts_to_sequences in module keras_preprocessing.text:

texts_to_sequences(texts) method of keras_preprocessing.text.Tokenizer instance
    Transforms each text in texts to a sequence of integers.
    
    Only top `num_words-1` most frequent words will be taken into account.
    Only words known by the tokenizer will be taken into account.
    
    # Arguments
        texts: A list of texts (strings).
    
    # Returns
        A list of sequences.

We're also going to have to pad them to make them the same length.

help(pad_sequences)

Help on function pad_sequences in module keras_preprocessing.sequence:

pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)
    Pads sequences to the same length.
    
    This function transforms a list of
    `num_samples` sequences (lists of integers)
    into a 2D Numpy array of shape `(num_samples, num_timesteps)`.
    `num_timesteps` is either the `maxlen` argument if provided,
    or the length of the longest sequence otherwise.
    
    Sequences that are shorter than `num_timesteps`
    are padded with `value` at the end.
    
    Sequences longer than `num_timesteps` are truncated
    so that they fit the desired length.
    The position where padding or truncation happens is determined by
    the arguments `padding` and `truncating`, respectively.
    
    Pre-padding is the default.
    
    # Arguments
        sequences: List of lists, where each element is a sequence.
        maxlen: Int, maximum length of all sequences.
        dtype: Type of the output sequences.
            To pad sequences with variable length strings, you can use `object`.
        padding: String, 'pre' or 'post':
            pad either before or after each sequence.
        truncating: String, 'pre' or 'post':
            remove values from sequences larger than
            `maxlen`, either at the beginning or at the end of the sequences.
        value: Float or String, padding value.
    
    # Returns
        x: Numpy array with shape `(len(sequences), maxlen)`
    
    # Raises
        ValueError: In case of invalid values for `truncating` or `padding`,
            or in case of invalid shape for a `sequences` entry.

training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(training_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)

testing_sequences = tokenizer.texts_to_sequences(x_test)
testing_padded = pad_sequences(testing_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)

Build The Model

We're going to use a convolutional neural network to try and classify our headlines as sarcastic or not-sarcastic.

It's a Sequence of Layers

print(tensorflow.keras.Sequential.__doc__)

Linear stack of layers.

  Arguments:
      layers: list of layers to add to the model.

  Example:

  ```python
  # Optionally, the first layer can receive an `input_shape` argument:
  model = Sequential()
  model.add(Dense(32, input_shape=(500,)))
  # Afterwards, we do automatic shape inference:
  model.add(Dense(32))

  # This is identical to the following:
  model = Sequential()
  model.add(Dense(32, input_dim=500))

  # And to the following:
  model = Sequential()
  model.add(Dense(32, batch_input_shape=(None, 500)))

  # Note that you can also omit the `input_shape` argument:
  # In that case the model gets built the first time you call `fit` (or other
  # training and evaluation methods).
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.compile(optimizer=optimizer, loss=loss)
  # This builds the model for the first time:
  model.fit(x, y, batch_size=32, epochs=10)

  # Note that when using this delayed-build pattern (no input shape specified),
  # the model doesn't have any weights until the first call
  # to a training/evaluation method (since it isn't yet built):
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.weights  # returns []

  # Whereas if you specify the input shape, the model gets built continuously
  # as you are adding layers:
  model = Sequential()
  model.add(Dense(32, input_shape=(500,)))
  model.add(Dense(32))
  model.weights  # returns list of length 4

  # When using the delayed-build pattern (no input shape specified), you can
  # choose to manually build your model by calling `build(batch_input_shape)`:
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.build((None, 500))
  model.weights  # returns list of length 4
  ```

Start With An Embedding Layer

print(tensorflow.keras.layers.Embedding.__doc__)

Turns positive integers (indexes) into dense vectors of fixed size.

  e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]`

  This layer can only be used as the first layer in a model.

  Example:

  ```python
  model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch,
  # input_length).
  # the largest integer (i.e. word index) in the input should be no larger
  # than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch
  # dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)
  ```

  Arguments:
    input_dim: int > 0. Size of the vocabulary,
      i.e. maximum integer index + 1.
    output_dim: int >= 0. Dimension of the dense embedding.
    embeddings_initializer: Initializer for the `embeddings` matrix.
    embeddings_regularizer: Regularizer function applied to
      the `embeddings` matrix.
    embeddings_constraint: Constraint function applied to
      the `embeddings` matrix.
    mask_zero: Whether or not the input value 0 is a special "padding"
      value that should be masked out.
      This is useful when using recurrent layers
      which may take variable length input.
      If this is `True` then all subsequent layers
      in the model need to support masking or an exception will be raised.
      If mask_zero is set to True, as a consequence, index 0 cannot be
      used in the vocabulary (input_dim should equal size of
      vocabulary + 1).
    input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

  Input shape:
    2D tensor with shape: `(batch_size, input_length)`.

  Output shape:
    3D tensor with shape: `(batch_size, input_length, output_dim)`.

The Convolutional Layer

print(tensorflow.keras.layers.Conv1D.__doc__)

1D convolution layer (e.g. temporal convolution).

  This layer creates a convolution kernel that is convolved
  with the layer input over a single spatial (or temporal) dimension
  to produce a tensor of outputs.
  If `use_bias` is True, a bias vector is created and added to the outputs.
  Finally, if `activation` is not `None`,
  it is applied to the outputs as well.

  When using this layer as the first layer in a model,
  provide an `input_shape` argument
  (tuple of integers or `None`, e.g.
  `(10, 128)` for sequences of 10 vectors of 128-dimensional vectors,
  or `(None, 128)` for variable-length sequences of 128-dimensional vectors.

  Arguments:
    filters: Integer, the dimensionality of the output space
      (i.e. the number of output filters in the convolution).
    kernel_size: An integer or tuple/list of a single integer,
      specifying the length of the 1D convolution window.
    strides: An integer or tuple/list of a single integer,
      specifying the stride length of the convolution.
      Specifying any stride value != 1 is incompatible with specifying
      any `dilation_rate` value != 1.
    padding: One of `"valid"`, `"causal"` or `"same"` (case-insensitive).
      `"causal"` results in causal (dilated) convolutions, e.g. output[t]
      does not depend on input[t+1:]. Useful when modeling temporal data
      where the model should not violate the temporal order.
      See [WaveNet: A Generative Model for Raw Audio, section
        2.1](https://arxiv.org/abs/1609.03499).
    data_format: A string,
      one of `channels_last` (default) or `channels_first`.
    dilation_rate: an integer or tuple/list of a single integer, specifying
      the dilation rate to use for dilated convolution.
      Currently, specifying any `dilation_rate` value != 1 is
      incompatible with specifying any `strides` value != 1.
    activation: Activation function to use.
      If you don't specify anything, no activation is applied
      (ie. "linear" activation: `a(x) = x`).
    use_bias: Boolean, whether the layer uses a bias vector.
    kernel_initializer: Initializer for the `kernel` weights matrix.
    bias_initializer: Initializer for the bias vector.
    kernel_regularizer: Regularizer function applied to
      the `kernel` weights matrix.
    bias_regularizer: Regularizer function applied to the bias vector.
    activity_regularizer: Regularizer function applied to
      the output of the layer (its "activation")..
    kernel_constraint: Constraint function applied to the kernel matrix.
    bias_constraint: Constraint function applied to the bias vector.

  Examples:
    ```python
    # Small convolutional model for 128-length vectors with 6 timesteps
    # model.input_shape == (None, 6, 128)
    
    model = Sequential()
    model.add(Conv1D(32, 3, 
              activation='relu', 
              input_shape=(6, 128)))
    
    # now: model.output_shape == (None, 4, 32)
    ```

  Input shape:
    3D tensor with shape: `(batch_size, steps, input_dim)`

  Output shape:
    3D tensor with shape: `(batch_size, new_steps, filters)`
      `steps` value might have changed due to padding or strides.

A Pooling Layer

print(tensorflow.keras.layers.GlobalMaxPooling1D.__doc__)

Global max pooling operation for temporal data.

  Arguments:
    data_format: A string,
      one of `channels_last` (default) or `channels_first`.
      The ordering of the dimensions in the inputs.
      `channels_last` corresponds to inputs with shape
      `(batch, steps, features)` while `channels_first`
      corresponds to inputs with shape
      `(batch, features, steps)`.

  Input shape:
    - If `data_format='channels_last'`:
      3D tensor with shape:
      `(batch_size, steps, features)`
    - If `data_format='channels_first'`:
      3D tensor with shape:
      `(batch_size, features, steps)`

  Output shape:
    2D tensor with shape `(batch_size, features)`.

The Fully-Connected Layers

Finally our output layers.

print(tensorflow.keras.layers.Dense.__doc__)

Just your regular densely-connected NN layer.

  `Dense` implements the operation:
  `output = activation(dot(input, kernel) + bias)`
  where `activation` is the element-wise activation function
  passed as the `activation` argument, `kernel` is a weights matrix
  created by the layer, and `bias` is a bias vector created by the layer
  (only applicable if `use_bias` is `True`).

  Note: If the input to the layer has a rank greater than 2, then
  it is flattened prior to the initial dot product with `kernel`.

  Example:

  ```python
  # as first layer in a sequential model:
  model = Sequential()
  model.add(Dense(32, input_shape=(16,)))
  # now the model will take as input arrays of shape (*, 16)
  # and output arrays of shape (*, 32)

  # after the first layer, you don't need to specify
  # the size of the input anymore:
  model.add(Dense(32))
  ```

  Arguments:
    units: Positive integer, dimensionality of the output space.
    activation: Activation function to use.
      If you don't specify anything, no activation is applied
      (ie. "linear" activation: `a(x) = x`).
    use_bias: Boolean, whether the layer uses a bias vector.
    kernel_initializer: Initializer for the `kernel` weights matrix.
    bias_initializer: Initializer for the bias vector.
    kernel_regularizer: Regularizer function applied to
      the `kernel` weights matrix.
    bias_regularizer: Regularizer function applied to the bias vector.
    activity_regularizer: Regularizer function applied to
      the output of the layer (its "activation")..
    kernel_constraint: Constraint function applied to
      the `kernel` weights matrix.
    bias_constraint: Constraint function applied to the bias vector.

  Input shape:
    N-D tensor with shape: `(batch_size, ..., input_dim)`.
    The most common situation would be
    a 2D input with shape `(batch_size, input_dim)`.

  Output shape:
    N-D tensor with shape: `(batch_size, ..., units)`.
    For instance, for a 2D input with shape `(batch_size, input_dim)`,
    the output would have shape `(batch_size, units)`.

Build It

I originally added the layers using the model.add method, but then when I tried to train it the output said the layers didn't have gradients and it never improved… I'll have to look into that, but in the meantime, passing them all in seems to work.

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(
        input_dim=Text.vocabulary_size,
        output_dim=Text.embedding_dim,
        input_length=Text.max_length),
    tensorflow.keras.layers.Conv1D(filters=128,
                                   kernel_size=5,
                                   activation='relu'),
    tensorflow.keras.layers.GlobalMaxPooling1D(),
    tensorflow.keras.layers.Dense(24, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])

Compile It

print(model.compile.__doc__)

Configures the model for training.

    Arguments:
        optimizer: String (name of optimizer) or optimizer instance.
            See `tf.keras.optimizers`.
        loss: String (name of objective function), objective function or
            `tf.losses.Loss` instance. See `tf.losses`. If the model has
            multiple outputs, you can use a different loss on each output by
            passing a dictionary or a list of losses. The loss value that will
            be minimized by the model will then be the sum of all individual
            losses.
        metrics: List of metrics to be evaluated by the model during training
            and testing. Typically you will use `metrics=['accuracy']`.
            To specify different metrics for different outputs of a
            multi-output model, you could also pass a dictionary, such as
            `metrics={'output_a': 'accuracy', 'output_b': ['accuracy', 'mse']}`.
            You can also pass a list (len = len(outputs)) of lists of metrics
            such as `metrics=[['accuracy'], ['accuracy', 'mse']]` or
            `metrics=['accuracy', ['accuracy', 'mse']]`.
        loss_weights: Optional list or dictionary specifying scalar
            coefficients (Python floats) to weight the loss contributions
            of different model outputs.
            The loss value that will be minimized by the model
            will then be the *weighted sum* of all individual losses,
            weighted by the `loss_weights` coefficients.
            If a list, it is expected to have a 1:1 mapping
            to the model's outputs. If a tensor, it is expected to map
            output names (strings) to scalar coefficients.
        sample_weight_mode: If you need to do timestep-wise
            sample weighting (2D weights), set this to `"temporal"`.
            `None` defaults to sample-wise weights (1D).
            If the model has multiple outputs, you can use a different
            `sample_weight_mode` on each output by passing a
            dictionary or a list of modes.
        weighted_metrics: List of metrics to be evaluated and weighted
            by sample_weight or class_weight during training and testing.
        target_tensors: By default, Keras will create placeholders for the
            model's target, which will be fed with the target data during
            training. If instead you would like to use your own
            target tensors (in turn, Keras will not expect external
            Numpy data for these targets at training time), you
            can specify them via the `target_tensors` argument. It can be
            a single tensor (for a single-output model), a list of tensors,
            or a dict mapping output names to target tensors.
        distribute: NOT SUPPORTED IN TF 2.0, please create and compile the
            model under distribution strategy scope instead of passing it to
            compile.
        **kwargs: Any additional arguments.

    Raises:
        ValueError: In case of invalid arguments for
            `optimizer`, `loss`, `metrics` or `sample_weight_mode`.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 120, 16)           16000     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 116, 128)          10368     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 24)                3096      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 25        
=================================================================
Total params: 29,489
Trainable params: 29,489
Non-trainable params: 0
_________________________________________________________________
None

Train It

print(model.fit.__doc__)

Trains the model for a fixed number of epochs (iterations on a dataset).

    Arguments:
        x: Input data. It could be:
          - A Numpy array (or array-like), or a list of arrays
            (in case the model has multiple inputs).
          - A TensorFlow tensor, or a list of tensors
            (in case the model has multiple inputs).
          - A dict mapping input names to the corresponding array/tensors,
            if the model has named inputs.
          - A `tf.data` dataset. Should return a tuple
            of either `(inputs, targets)` or
            `(inputs, targets, sample_weights)`.
          - A generator or `keras.utils.Sequence` returning `(inputs, targets)`
            or `(inputs, targets, sample weights)`.
        y: Target data. Like the input data `x`,
          it could be either Numpy array(s) or TensorFlow tensor(s).
          It should be consistent with `x` (you cannot have Numpy inputs and
          tensor targets, or inversely). If `x` is a dataset, generator,
          or `keras.utils.Sequence` instance, `y` should
          not be specified (since targets will be obtained from `x`).
        batch_size: Integer or `None`.
            Number of samples per gradient update.
            If unspecified, `batch_size` will default to 32.
            Do not specify the `batch_size` if your data is in the
            form of symbolic tensors, datasets,
            generators, or `keras.utils.Sequence` instances (since they generate
            batches).
        epochs: Integer. Number of epochs to train the model.
            An epoch is an iteration over the entire `x` and `y`
            data provided.
            Note that in conjunction with `initial_epoch`,
            `epochs` is to be understood as "final epoch".
            The model is not trained for a number of iterations
            given by `epochs`, but merely until the epoch
            of index `epochs` is reached.
        verbose: 0, 1, or 2. Verbosity mode.
            0 = silent, 1 = progress bar, 2 = one line per epoch.
            Note that the progress bar is not particularly useful when
            logged to a file, so verbose=2 is recommended when not running
            interactively (eg, in a production environment).
        callbacks: List of `keras.callbacks.Callback` instances.
            List of callbacks to apply during training.
            See `tf.keras.callbacks`.
        validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.
            The validation data is selected from the last samples
            in the `x` and `y` data provided, before shuffling. This argument is
            not supported when `x` is a dataset, generator or
           `keras.utils.Sequence` instance.
        validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`.
            `validation_data` could be:
              - tuple `(x_val, y_val)` of Numpy arrays or tensors
              - tuple `(x_val, y_val, val_sample_weights)` of Numpy arrays
              - dataset
            For the first two cases, `batch_size` must be provided.
            For the last case, `validation_steps` must be provided.
        shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.
            Has no effect when `steps_per_epoch` is not `None`.
        class_weight: Optional dictionary mapping class indices (integers)
            to a weight (float) value, used for weighting the loss function
            (during training only).
            This can be useful to tell the model to
            "pay more attention" to samples from
            an under-represented class.
        sample_weight: Optional Numpy array of weights for
            the training samples, used for weighting the loss function
            (during training only). You can either pass a flat (1D)
            Numpy array with the same length as the input samples
            (1:1 mapping between weights and samples),
            or in the case of temporal data,
            you can pass a 2D array with shape
            `(samples, sequence_length)`,
            to apply a different weight to every timestep of every sample.
            In this case you should make sure to specify
            `sample_weight_mode="temporal"` in `compile()`. This argument is not
            supported when `x` is a dataset, generator, or
           `keras.utils.Sequence` instance, instead provide the sample_weights
            as the third element of `x`.
        initial_epoch: Integer.
            Epoch at which to start training
            (useful for resuming a previous training run).
        steps_per_epoch: Integer or `None`.
            Total number of steps (batches of samples)
            before declaring one epoch finished and starting the
            next epoch. When training with input tensors such as
            TensorFlow data tensors, the default `None` is equal to
            the number of samples in your dataset divided by
            the batch size, or 1 if that cannot be determined. If x is a
            `tf.data` dataset, and 'steps_per_epoch'
            is None, the epoch will run until the input dataset is exhausted.
            This argument is not supported with array inputs.
        validation_steps: Only relevant if `validation_data` is provided and
            is a `tf.data` dataset. Total number of steps (batches of
            samples) to draw before stopping when performing validation
            at the end of every epoch. If validation_data is a `tf.data` dataset
            and 'validation_steps' is None, validation
            will run until the `validation_data` dataset is exhausted.
        validation_freq: Only relevant if validation data is provided. Integer
            or `collections_abc.Container` instance (e.g. list, tuple, etc.).
            If an integer, specifies how many training epochs to run before a
            new validation run is performed, e.g. `validation_freq=2` runs
            validation every 2 epochs. If a Container, specifies the epochs on
            which to run validation, e.g. `validation_freq=[1, 2, 10]` runs
            validation at the end of the 1st, 2nd, and 10th epochs.
        max_queue_size: Integer. Used for generator or `keras.utils.Sequence`
            input only. Maximum size for the generator queue.
            If unspecified, `max_queue_size` will default to 10.
        workers: Integer. Used for generator or `keras.utils.Sequence` input
            only. Maximum number of processes to spin up
            when using process-based threading. If unspecified, `workers`
            will default to 1. If 0, will execute the generator on the main
            thread.
        use_multiprocessing: Boolean. Used for generator or
            `keras.utils.Sequence` input only. If `True`, use process-based
            threading. If unspecified, `use_multiprocessing` will default to
            `False`. Note that because this implementation relies on
            multiprocessing, you should not pass non-picklable arguments to
            the generator as they can't be passed easily to children processes.
        **kwargs: Used for backwards compatibility.

    Returns:
        A `History` object. Its `History.history` attribute is
        a record of training loss values and metrics values
        at successive epochs, as well as validation loss values
        and validation metrics values (if applicable).

    Raises:
        RuntimeError: If the model was never compiled.
        ValueError: In case of mismatch between the provided input data
            and what the model expects.

with TIMER:
    history = model.fit(training_padded,
                        y_train.values,
                        epochs=Training.epochs,
                        validation_data=(testing_padded, y_test.values),
                        verbose=Training.verbosity)

2019-09-22 16:30:20,369 graeae.timers.timer start: Started: 2019-09-22 16:30:20.369886
I0922 16:30:20.369913 139873020925760 timer.py:70] Started: 2019-09-22 16:30:20.369886
Train on 20031 samples, validate on 6678 samples
Epoch 1/50
20031/20031 - 5s - loss: 0.4741 - accuracy: 0.7623 - val_loss: 0.4033 - val_accuracy: 0.8146
Epoch 2/50
20031/20031 - 4s - loss: 0.3663 - accuracy: 0.8366 - val_loss: 0.3980 - val_accuracy: 0.8196
Epoch 3/50
20031/20031 - 4s - loss: 0.3306 - accuracy: 0.8554 - val_loss: 0.3909 - val_accuracy: 0.8240
Epoch 4/50
20031/20031 - 4s - loss: 0.2990 - accuracy: 0.8721 - val_loss: 0.4148 - val_accuracy: 0.8179
Epoch 5/50
20031/20031 - 4s - loss: 0.2697 - accuracy: 0.8867 - val_loss: 0.4050 - val_accuracy: 0.8282
Epoch 6/50
20031/20031 - 4s - loss: 0.2406 - accuracy: 0.9003 - val_loss: 0.4291 - val_accuracy: 0.8212
Epoch 7/50
20031/20031 - 4s - loss: 0.2080 - accuracy: 0.9165 - val_loss: 0.4650 - val_accuracy: 0.8181
Epoch 8/50
20031/20031 - 4s - loss: 0.1824 - accuracy: 0.9272 - val_loss: 0.5053 - val_accuracy: 0.8130
Epoch 9/50
20031/20031 - 4s - loss: 0.1559 - accuracy: 0.9393 - val_loss: 0.5389 - val_accuracy: 0.8065
Epoch 10/50
20031/20031 - 4s - loss: 0.1325 - accuracy: 0.9498 - val_loss: 0.6213 - val_accuracy: 0.8044
Epoch 11/50
20031/20031 - 4s - loss: 0.1104 - accuracy: 0.9599 - val_loss: 0.6902 - val_accuracy: 0.8034
Epoch 12/50
20031/20031 - 4s - loss: 0.0966 - accuracy: 0.9646 - val_loss: 0.7437 - val_accuracy: 0.8035
Epoch 13/50
20031/20031 - 4s - loss: 0.0848 - accuracy: 0.9689 - val_loss: 0.8285 - val_accuracy: 0.7954
Epoch 14/50
20031/20031 - 4s - loss: 0.0693 - accuracy: 0.9753 - val_loss: 0.9121 - val_accuracy: 0.7934
Epoch 15/50
20031/20031 - 4s - loss: 0.0608 - accuracy: 0.9777 - val_loss: 1.0783 - val_accuracy: 0.7931
Epoch 16/50
20031/20031 - 4s - loss: 0.0529 - accuracy: 0.9810 - val_loss: 1.0620 - val_accuracy: 0.7889
Epoch 17/50
20031/20031 - 4s - loss: 0.0506 - accuracy: 0.9819 - val_loss: 1.2497 - val_accuracy: 0.7889
Epoch 18/50
20031/20031 - 4s - loss: 0.0471 - accuracy: 0.9821 - val_loss: 1.2518 - val_accuracy: 0.7963
Epoch 19/50
20031/20031 - 4s - loss: 0.0457 - accuracy: 0.9819 - val_loss: 1.3492 - val_accuracy: 0.7917
Epoch 20/50
20031/20031 - 4s - loss: 0.0392 - accuracy: 0.9851 - val_loss: 1.3702 - val_accuracy: 0.7948
Epoch 21/50
20031/20031 - 4s - loss: 0.0357 - accuracy: 0.9860 - val_loss: 1.4300 - val_accuracy: 0.7948
Epoch 22/50
20031/20031 - 4s - loss: 0.0341 - accuracy: 0.9864 - val_loss: 1.5654 - val_accuracy: 0.7889
Epoch 23/50
20031/20031 - 4s - loss: 0.0360 - accuracy: 0.9860 - val_loss: 1.5615 - val_accuracy: 0.7951
Epoch 24/50
20031/20031 - 4s - loss: 0.0307 - accuracy: 0.9872 - val_loss: 1.6964 - val_accuracy: 0.7953
Epoch 25/50
20031/20031 - 4s - loss: 0.0283 - accuracy: 0.9893 - val_loss: 1.6917 - val_accuracy: 0.7920
Epoch 26/50
20031/20031 - 4s - loss: 0.0365 - accuracy: 0.9850 - val_loss: 1.6935 - val_accuracy: 0.7944
Epoch 27/50
20031/20031 - 4s - loss: 0.0342 - accuracy: 0.9851 - val_loss: 1.7912 - val_accuracy: 0.7853
Epoch 28/50
20031/20031 - 4s - loss: 0.0301 - accuracy: 0.9879 - val_loss: 1.8194 - val_accuracy: 0.7887
Epoch 29/50
20031/20031 - 4s - loss: 0.0254 - accuracy: 0.9887 - val_loss: 1.9231 - val_accuracy: 0.7922
Epoch 30/50
20031/20031 - 4s - loss: 0.0216 - accuracy: 0.9910 - val_loss: 1.9480 - val_accuracy: 0.7914
Epoch 31/50
20031/20031 - 4s - loss: 0.0243 - accuracy: 0.9895 - val_loss: 1.9487 - val_accuracy: 0.7847
Epoch 32/50
20031/20031 - 4s - loss: 0.0241 - accuracy: 0.9891 - val_loss: 2.0333 - val_accuracy: 0.7893
Epoch 33/50
20031/20031 - 4s - loss: 0.0334 - accuracy: 0.9863 - val_loss: 1.9498 - val_accuracy: 0.7937
Epoch 34/50
20031/20031 - 4s - loss: 0.0318 - accuracy: 0.9873 - val_loss: 2.0181 - val_accuracy: 0.7942
Epoch 35/50
20031/20031 - 4s - loss: 0.0273 - accuracy: 0.9882 - val_loss: 2.0254 - val_accuracy: 0.7913
Epoch 36/50
20031/20031 - 4s - loss: 0.0236 - accuracy: 0.9897 - val_loss: 2.1159 - val_accuracy: 0.7937
Epoch 37/50
20031/20031 - 4s - loss: 0.0204 - accuracy: 0.9905 - val_loss: 2.1018 - val_accuracy: 0.7950
Epoch 38/50
20031/20031 - 4s - loss: 0.0187 - accuracy: 0.9916 - val_loss: 2.1939 - val_accuracy: 0.7947
Epoch 39/50
20031/20031 - 4s - loss: 0.0253 - accuracy: 0.9888 - val_loss: 2.2090 - val_accuracy: 0.7920
Epoch 40/50
20031/20031 - 4s - loss: 0.0270 - accuracy: 0.9889 - val_loss: 2.2737 - val_accuracy: 0.7862
Epoch 41/50
20031/20031 - 4s - loss: 0.0234 - accuracy: 0.9893 - val_loss: 2.2559 - val_accuracy: 0.7926
Epoch 42/50
20031/20031 - 4s - loss: 0.0223 - accuracy: 0.9902 - val_loss: 2.3223 - val_accuracy: 0.7884
Epoch 43/50
20031/20031 - 4s - loss: 0.0251 - accuracy: 0.9897 - val_loss: 2.2547 - val_accuracy: 0.7863
Epoch 44/50
20031/20031 - 4s - loss: 0.0209 - accuracy: 0.9900 - val_loss: 2.3917 - val_accuracy: 0.7823
Epoch 45/50
20031/20031 - 4s - loss: 0.0245 - accuracy: 0.9889 - val_loss: 2.4222 - val_accuracy: 0.7881
Epoch 46/50
20031/20031 - 4s - loss: 0.0215 - accuracy: 0.9901 - val_loss: 2.4135 - val_accuracy: 0.7869
Epoch 47/50
20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9896 - val_loss: 2.3287 - val_accuracy: 0.7823
Epoch 48/50
20031/20031 - 4s - loss: 0.0191 - accuracy: 0.9918 - val_loss: 2.4639 - val_accuracy: 0.7845
Epoch 49/50
20031/20031 - 4s - loss: 0.0183 - accuracy: 0.9911 - val_loss: 2.6068 - val_accuracy: 0.7811
Epoch 50/50
20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9897 - val_loss: 2.5152 - val_accuracy: 0.7928
2019-09-22 16:33:41,089 graeae.timers.timer end: Ended: 2019-09-22 16:33:41.089405
I0922 16:33:41.089459 139873020925760 timer.py:77] Ended: 2019-09-22 16:33:41.089405
2019-09-22 16:33:41,091 graeae.timers.timer end: Elapsed: 0:03:20.719519
I0922 16:33:41.091247 139873020925760 timer.py:78] Elapsed: 0:03:20.719519

Once again it looks like the model is overfitting, I should add a checkpoint or something.

Plot the Performance

performance = pandas.DataFrame(history.history)
plot = performance.hvplot().opts(title="CNN Sarcasm Training Performance",
                                 width=1000,
                                 height=800)
Embed(plot=plot, file_name="cnn_training")()

There's something very wrong with the validation. I'll have to look into that.

Table of Contents