He Used Sarcasm
Beginning
This is a look at fitting a model to detect sarcasm using a json blob from Laurence Moroney (https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json).
Imports
Python
from argparse import Namespace
from functools import partial
from pathlib import Path
from pprint import pprint
from urllib.parse import urlparse
import json
PyPi
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import hvplot.pandas
import pandas
import tensorflow
Other
from graeae import (
CountPercentage,
EmbedHoloviews,
TextDownloader,
Timer
)
Set Up
The Timer
TIMER = Timer()
The Plotting
SLUG = "he-used-sarcasm"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")
The Data
URL = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json"
path = Path("~/data/datasets/text/sarcasm/sarcasm.json").expanduser()
downloader = TextDownloader(URL, path)
with TIMER:
data = json.loads(downloader.download)
2019-09-22 15:05:27,302 graeae.timers.timer start: Started: 2019-09-22 15:05:27.301225 WARNING: Logging before flag parsing goes to stderr. I0922 15:05:27.302001 139873020925760 timer.py:70] Started: 2019-09-22 15:05:27.301225 2019-09-22 15:05:27,306 [1mTextDownloader[0m download: /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it I0922 15:05:27.306186 139873020925760 downloader.py:51] /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it 2019-09-22 15:05:27,367 graeae.timers.timer end: Ended: 2019-09-22 15:05:27.367036 I0922 15:05:27.367099 139873020925760 timer.py:77] Ended: 2019-09-22 15:05:27.367036 2019-09-22 15:05:27,369 graeae.timers.timer end: Elapsed: 0:00:00.065811 I0922 15:05:27.369417 139873020925760 timer.py:78] Elapsed: 0:00:00.065811
Middle
Looking At the Data
pprint(data[0])
{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', 'headline': "former versace store clerk sues over secret 'black code' for " 'minority shoppers', 'is_sarcastic': 0}
So our data is a dictionary with three keys - the source of the article, the headline of the article, and whether it's a sarcastic headline or not. There's no citation in the original notebook, but it looks like it might be this one on GitHub.
data = pandas.DataFrame(data)
data.loc[:, "site"] = data.article_link.apply(lambda link: urlparse(link).netloc)
CountPercentage(data.site, value_label="Site")()
Site | Count | Percent (%) |
---|---|---|
www.huffingtonpost.com | 14403 | 53.93 |
www.theonion.com | 5811 | 21.76 |
local.theonion.com | 2852 | 10.68 |
politics.theonion.com | 1767 | 6.62 |
entertainment.theonion.com | 1194 | 4.47 |
www.huffingtonpost.comhttp: | 503 | 1.88 |
sports.theonion.com | 100 | 0.37 |
www.huffingtonpost.comhttps: | 79 | 0.30 |
So, it looks like there's some problems with the URLs. I don't think that's important, but maybe I can clean up a little anyway.
print(
data[
data.site.str.contains("www.huffingtonpost.comhttp"
)].article_link.value_counts())
https://www.huffingtonpost.comhttp://nymag.com/daily/intelligencer/2016/05/hillary-clinton-candidacy.html 2 https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostQueerVoices/videos/1153919084666530/ 1 https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/chris-meloni-star-underground/56d8584f99ec6dca3d00000a 1 https://www.huffingtonpost.comhttp://www.thestreet.com/story/13223501/1/post-retirement-work-may-not-save-your-golden-years.html 1 https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostEntertainment/ 1 .. https://www.huffingtonpost.comhttp://nymag.com/thecut/2015/10/first-legal-abortionists-tell-their-stories.html?mid=twitter_nymag 1 https://www.huffingtonpost.comhttp://www.tampabay.com/blogs/the-buzz-florida-politics/marco-rubio-warming-up-to-donald-trump/2275308 1 https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/porn-to-pay-for-college/55aeadf62b8c2a2f6f000193 1 https://www.huffingtonpost.comhttps://www.thedodo.com/dog-mouth-taped-shut-facebook-1481874724.html 1 https://www.huffingtonpost.comhttps://www.washingtonpost.com/politics/ben-carson-to-tell-supporters-he-sees-no-path-forward-for-campaign/2016/03/02/d6bef352-d9b3-11e5-891a-4ed04f4213e8_story.html 1 Name: article_link, Length: 581, dtype: int64
That's kind of odd, I don't know what that means, maybe the Huffington Post was citing other sites? I went to go check the GitHub dataset I mentioned but it's actually much larger than this one so I don't know if it's really the source or not.
prefixes = ("www.huffingtonpost.comhttp:", "www.huffingtonpost.comhttps:")
for prefix in prefixes:
data.loc[:, "site"] = data.site.str.replace(
prefix,
"www.huffingtonpost.com")
prefixes = ("local.theonion.com",
"politics.theonion.com",
"entertainment.theonion.com",
"sports.theonion.com")
for prefix in prefixes:
data.loc[:, "site"] = data.site.str.replace(prefix,
"www.theonion.com")
counter = CountPercentage(data.site, value_label="Site")
counter()
Site | Count | Percent (%) |
---|---|---|
www.huffingtonpost.com | 14985 | 56.10 |
www.theonion.com | 11724 | 43.90 |
plot = counter.table.hvplot.bar(x="Site", y="Count").opts(
title="Distribution by Site",
width=1000,
height=800)
Embed(plot=plot, file_name="site_distribution")()
counter = CountPercentage(data.is_sarcastic, value_label="Is Sarcastic")
counter()
Is Sarcastic | Count | Percent (%) |
---|---|---|
0 | 14,985 | 56.10 |
1 | 11,724 | 43.90 |
Given that the counts match I'm assuming anything from the Huffington Post is labeled as not sarcastic and anything from the onion is sarcastic.
assert all(data[data.site=="www.onion.com"].is_sarcastic)
assert not any(data[data.site=="www.huffingtonpost.com"].is_sarcastic)
Set Up the Tokenizing and Training Data
print(f"{len(data):,}")
26,709
Text = Namespace(
vocabulary_size = 1000,
embedding_dim = 16,
max_length = 120,
truncating_type='post',
padding_type='post',
out_of_vocabulary_tok = "<OOV>",
)
# this is actually the default for train_test_split
Training = Namespace(
size = 0.75,
epochs = 50,
verbosity = 2,
)
The Training and Testing Data
x_train, x_test, y_train, y_test = train_test_split(
data.headline, data.is_sarcastic, train_size=Training.size,
)
The Tokenizer
tokenizer = Tokenizer(num_words=Text.vocabulary_size,
oov_token=Text.out_of_vocabulary_tok)
print(tokenizer.__doc__)
Text tokenization utility class. This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf... # Arguments num_words: the maximum number of words to keep, based on word frequency. Only the most common `num_words-1` words will be kept. filters: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the `'` character. lower: boolean. Whether to convert the texts to lowercase. split: str. Separator for word splitting. char_level: if True, every character will be treated as a token. oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the `'` character). These sequences are then split into lists of tokens. They will then be indexed or vectorized. `0` is a reserved index that won't be assigned to any word.
Now that we have a tokenizer we can tokenize our training headlines.
help(tokenizer.fit_on_texts)
Help on method fit_on_texts in module keras_preprocessing.text: fit_on_texts(texts) method of keras_preprocessing.text.Tokenizer instance Updates internal vocabulary based on a list of texts. In the case where texts contains lists, we assume each entry of the lists to be a token. Required before using `texts_to_sequences` or `texts_to_matrix`. # Arguments texts: can be a list of strings, a generator of strings (for memory-efficiency), or a list of list of strings.
tokenizer.fit_on_texts(x_train)
Now that we've fit the headlines we can get the word index, a dict mapping words to their index.
word_index = tokenizer.word_index
Note that the tokenizer doesn't remove stop-words.
print("the" in word_index)
True
Now we'll convert the training headlines to sequences of numbers.
help(tokenizer.texts_to_sequences)
Help on method texts_to_sequences in module keras_preprocessing.text: texts_to_sequences(texts) method of keras_preprocessing.text.Tokenizer instance Transforms each text in texts to a sequence of integers. Only top `num_words-1` most frequent words will be taken into account. Only words known by the tokenizer will be taken into account. # Arguments texts: A list of texts (strings). # Returns A list of sequences.
We're also going to have to pad them to make them the same length.
help(pad_sequences)
Help on function pad_sequences in module keras_preprocessing.sequence: pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0) Pads sequences to the same length. This function transforms a list of `num_samples` sequences (lists of integers) into a 2D Numpy array of shape `(num_samples, num_timesteps)`. `num_timesteps` is either the `maxlen` argument if provided, or the length of the longest sequence otherwise. Sequences that are shorter than `num_timesteps` are padded with `value` at the end. Sequences longer than `num_timesteps` are truncated so that they fit the desired length. The position where padding or truncation happens is determined by the arguments `padding` and `truncating`, respectively. Pre-padding is the default. # Arguments sequences: List of lists, where each element is a sequence. maxlen: Int, maximum length of all sequences. dtype: Type of the output sequences. To pad sequences with variable length strings, you can use `object`. padding: String, 'pre' or 'post': pad either before or after each sequence. truncating: String, 'pre' or 'post': remove values from sequences larger than `maxlen`, either at the beginning or at the end of the sequences. value: Float or String, padding value. # Returns x: Numpy array with shape `(len(sequences), maxlen)` # Raises ValueError: In case of invalid values for `truncating` or `padding`, or in case of invalid shape for a `sequences` entry.
training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(training_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)
testing_sequences = tokenizer.texts_to_sequences(x_test)
testing_padded = pad_sequences(testing_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)
Build The Model
We're going to use a convolutional neural network to try and classify our headlines as sarcastic or not-sarcastic.
It's a Sequence of Layers
print(tensorflow.keras.Sequential.__doc__)
Linear stack of layers. Arguments: layers: list of layers to add to the model. Example: ```python # Optionally, the first layer can receive an `input_shape` argument: model = Sequential() model.add(Dense(32, input_shape=(500,))) # Afterwards, we do automatic shape inference: model.add(Dense(32)) # This is identical to the following: model = Sequential() model.add(Dense(32, input_dim=500)) # And to the following: model = Sequential() model.add(Dense(32, batch_input_shape=(None, 500))) # Note that you can also omit the `input_shape` argument: # In that case the model gets built the first time you call `fit` (or other # training and evaluation methods). model = Sequential() model.add(Dense(32)) model.add(Dense(32)) model.compile(optimizer=optimizer, loss=loss) # This builds the model for the first time: model.fit(x, y, batch_size=32, epochs=10) # Note that when using this delayed-build pattern (no input shape specified), # the model doesn't have any weights until the first call # to a training/evaluation method (since it isn't yet built): model = Sequential() model.add(Dense(32)) model.add(Dense(32)) model.weights # returns [] # Whereas if you specify the input shape, the model gets built continuously # as you are adding layers: model = Sequential() model.add(Dense(32, input_shape=(500,))) model.add(Dense(32)) model.weights # returns list of length 4 # When using the delayed-build pattern (no input shape specified), you can # choose to manually build your model by calling `build(batch_input_shape)`: model = Sequential() model.add(Dense(32)) model.add(Dense(32)) model.build((None, 500)) model.weights # returns list of length 4 ```
Start With An Embedding Layer
print(tensorflow.keras.layers.Embedding.__doc__)
Turns positive integers (indexes) into dense vectors of fixed size. e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]` This layer can only be used as the first layer in a model. Example: ```python model = Sequential() model.add(Embedding(1000, 64, input_length=10)) # the model will take as input an integer matrix of size (batch, # input_length). # the largest integer (i.e. word index) in the input should be no larger # than 999 (vocabulary size). # now model.output_shape == (None, 10, 64), where None is the batch # dimension. input_array = np.random.randint(1000, size=(32, 10)) model.compile('rmsprop', 'mse') output_array = model.predict(input_array) assert output_array.shape == (32, 10, 64) ``` Arguments: input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1. output_dim: int >= 0. Dimension of the dense embedding. embeddings_initializer: Initializer for the `embeddings` matrix. embeddings_regularizer: Regularizer function applied to the `embeddings` matrix. embeddings_constraint: Constraint function applied to the `embeddings` matrix. mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is `True` then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1). input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect `Flatten` then `Dense` layers upstream (without it, the shape of the dense outputs cannot be computed). Input shape: 2D tensor with shape: `(batch_size, input_length)`. Output shape: 3D tensor with shape: `(batch_size, input_length, output_dim)`.
The Convolutional Layer
print(tensorflow.keras.layers.Conv1D.__doc__)
1D convolution layer (e.g. temporal convolution). This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs. If `use_bias` is True, a bias vector is created and added to the outputs. Finally, if `activation` is not `None`, it is applied to the outputs as well. When using this layer as the first layer in a model, provide an `input_shape` argument (tuple of integers or `None`, e.g. `(10, 128)` for sequences of 10 vectors of 128-dimensional vectors, or `(None, 128)` for variable-length sequences of 128-dimensional vectors. Arguments: filters: Integer, the dimensionality of the output space (i.e. the number of output filters in the convolution). kernel_size: An integer or tuple/list of a single integer, specifying the length of the 1D convolution window. strides: An integer or tuple/list of a single integer, specifying the stride length of the convolution. Specifying any stride value != 1 is incompatible with specifying any `dilation_rate` value != 1. padding: One of `"valid"`, `"causal"` or `"same"` (case-insensitive). `"causal"` results in causal (dilated) convolutions, e.g. output[t] does not depend on input[t+1:]. Useful when modeling temporal data where the model should not violate the temporal order. See [WaveNet: A Generative Model for Raw Audio, section 2.1](https://arxiv.org/abs/1609.03499). data_format: A string, one of `channels_last` (default) or `channels_first`. dilation_rate: an integer or tuple/list of a single integer, specifying the dilation rate to use for dilated convolution. Currently, specifying any `dilation_rate` value != 1 is incompatible with specifying any `strides` value != 1. activation: Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation: `a(x) = x`). use_bias: Boolean, whether the layer uses a bias vector. kernel_initializer: Initializer for the `kernel` weights matrix. bias_initializer: Initializer for the bias vector. kernel_regularizer: Regularizer function applied to the `kernel` weights matrix. bias_regularizer: Regularizer function applied to the bias vector. activity_regularizer: Regularizer function applied to the output of the layer (its "activation").. kernel_constraint: Constraint function applied to the kernel matrix. bias_constraint: Constraint function applied to the bias vector. Examples: ```python # Small convolutional model for 128-length vectors with 6 timesteps # model.input_shape == (None, 6, 128) model = Sequential() model.add(Conv1D(32, 3, activation='relu', input_shape=(6, 128))) # now: model.output_shape == (None, 4, 32) ``` Input shape: 3D tensor with shape: `(batch_size, steps, input_dim)` Output shape: 3D tensor with shape: `(batch_size, new_steps, filters)` `steps` value might have changed due to padding or strides.
A Pooling Layer
print(tensorflow.keras.layers.GlobalMaxPooling1D.__doc__)
Global max pooling operation for temporal data. Arguments: data_format: A string, one of `channels_last` (default) or `channels_first`. The ordering of the dimensions in the inputs. `channels_last` corresponds to inputs with shape `(batch, steps, features)` while `channels_first` corresponds to inputs with shape `(batch, features, steps)`. Input shape: - If `data_format='channels_last'`: 3D tensor with shape: `(batch_size, steps, features)` - If `data_format='channels_first'`: 3D tensor with shape: `(batch_size, features, steps)` Output shape: 2D tensor with shape `(batch_size, features)`.
The Fully-Connected Layers
Finally our output layers.
print(tensorflow.keras.layers.Dense.__doc__)
Just your regular densely-connected NN layer. `Dense` implements the operation: `output = activation(dot(input, kernel) + bias)` where `activation` is the element-wise activation function passed as the `activation` argument, `kernel` is a weights matrix created by the layer, and `bias` is a bias vector created by the layer (only applicable if `use_bias` is `True`). Note: If the input to the layer has a rank greater than 2, then it is flattened prior to the initial dot product with `kernel`. Example: ```python # as first layer in a sequential model: model = Sequential() model.add(Dense(32, input_shape=(16,))) # now the model will take as input arrays of shape (*, 16) # and output arrays of shape (*, 32) # after the first layer, you don't need to specify # the size of the input anymore: model.add(Dense(32)) ``` Arguments: units: Positive integer, dimensionality of the output space. activation: Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation: `a(x) = x`). use_bias: Boolean, whether the layer uses a bias vector. kernel_initializer: Initializer for the `kernel` weights matrix. bias_initializer: Initializer for the bias vector. kernel_regularizer: Regularizer function applied to the `kernel` weights matrix. bias_regularizer: Regularizer function applied to the bias vector. activity_regularizer: Regularizer function applied to the output of the layer (its "activation").. kernel_constraint: Constraint function applied to the `kernel` weights matrix. bias_constraint: Constraint function applied to the bias vector. Input shape: N-D tensor with shape: `(batch_size, ..., input_dim)`. The most common situation would be a 2D input with shape `(batch_size, input_dim)`. Output shape: N-D tensor with shape: `(batch_size, ..., units)`. For instance, for a 2D input with shape `(batch_size, input_dim)`, the output would have shape `(batch_size, units)`.
Build It
I originally added the layers using the model.add
method, but then when I tried to train it the output said the layers didn't have gradients and it never improved… I'll have to look into that, but in the meantime, passing them all in seems to work.
model = tensorflow.keras.Sequential([
tensorflow.keras.layers.Embedding(
input_dim=Text.vocabulary_size,
output_dim=Text.embedding_dim,
input_length=Text.max_length),
tensorflow.keras.layers.Conv1D(filters=128,
kernel_size=5,
activation='relu'),
tensorflow.keras.layers.GlobalMaxPooling1D(),
tensorflow.keras.layers.Dense(24, activation='relu'),
tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
Compile It
print(model.compile.__doc__)
Configures the model for training. Arguments: optimizer: String (name of optimizer) or optimizer instance. See `tf.keras.optimizers`. loss: String (name of objective function), objective function or `tf.losses.Loss` instance. See `tf.losses`. If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses. metrics: List of metrics to be evaluated by the model during training and testing. Typically you will use `metrics=['accuracy']`. To specify different metrics for different outputs of a multi-output model, you could also pass a dictionary, such as `metrics={'output_a': 'accuracy', 'output_b': ['accuracy', 'mse']}`. You can also pass a list (len = len(outputs)) of lists of metrics such as `metrics=[['accuracy'], ['accuracy', 'mse']]` or `metrics=['accuracy', ['accuracy', 'mse']]`. loss_weights: Optional list or dictionary specifying scalar coefficients (Python floats) to weight the loss contributions of different model outputs. The loss value that will be minimized by the model will then be the *weighted sum* of all individual losses, weighted by the `loss_weights` coefficients. If a list, it is expected to have a 1:1 mapping to the model's outputs. If a tensor, it is expected to map output names (strings) to scalar coefficients. sample_weight_mode: If you need to do timestep-wise sample weighting (2D weights), set this to `"temporal"`. `None` defaults to sample-wise weights (1D). If the model has multiple outputs, you can use a different `sample_weight_mode` on each output by passing a dictionary or a list of modes. weighted_metrics: List of metrics to be evaluated and weighted by sample_weight or class_weight during training and testing. target_tensors: By default, Keras will create placeholders for the model's target, which will be fed with the target data during training. If instead you would like to use your own target tensors (in turn, Keras will not expect external Numpy data for these targets at training time), you can specify them via the `target_tensors` argument. It can be a single tensor (for a single-output model), a list of tensors, or a dict mapping output names to target tensors. distribute: NOT SUPPORTED IN TF 2.0, please create and compile the model under distribution strategy scope instead of passing it to compile. **kwargs: Any additional arguments. Raises: ValueError: In case of invalid arguments for `optimizer`, `loss`, `metrics` or `sample_weight_mode`.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 120, 16) 16000 _________________________________________________________________ conv1d_1 (Conv1D) (None, 116, 128) 10368 _________________________________________________________________ global_max_pooling1d_2 (Glob (None, 128) 0 _________________________________________________________________ dense_4 (Dense) (None, 24) 3096 _________________________________________________________________ dense_5 (Dense) (None, 1) 25 ================================================================= Total params: 29,489 Trainable params: 29,489 Non-trainable params: 0 _________________________________________________________________ None
Train It
print(model.fit.__doc__)
Trains the model for a fixed number of epochs (iterations on a dataset). Arguments: x: Input data. It could be: - A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs). - A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs). - A dict mapping input names to the corresponding array/tensors, if the model has named inputs. - A `tf.data` dataset. Should return a tuple of either `(inputs, targets)` or `(inputs, targets, sample_weights)`. - A generator or `keras.utils.Sequence` returning `(inputs, targets)` or `(inputs, targets, sample weights)`. y: Target data. Like the input data `x`, it could be either Numpy array(s) or TensorFlow tensor(s). It should be consistent with `x` (you cannot have Numpy inputs and tensor targets, or inversely). If `x` is a dataset, generator, or `keras.utils.Sequence` instance, `y` should not be specified (since targets will be obtained from `x`). batch_size: Integer or `None`. Number of samples per gradient update. If unspecified, `batch_size` will default to 32. Do not specify the `batch_size` if your data is in the form of symbolic tensors, datasets, generators, or `keras.utils.Sequence` instances (since they generate batches). epochs: Integer. Number of epochs to train the model. An epoch is an iteration over the entire `x` and `y` data provided. Note that in conjunction with `initial_epoch`, `epochs` is to be understood as "final epoch". The model is not trained for a number of iterations given by `epochs`, but merely until the epoch of index `epochs` is reached. verbose: 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. Note that the progress bar is not particularly useful when logged to a file, so verbose=2 is recommended when not running interactively (eg, in a production environment). callbacks: List of `keras.callbacks.Callback` instances. List of callbacks to apply during training. See `tf.keras.callbacks`. validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the `x` and `y` data provided, before shuffling. This argument is not supported when `x` is a dataset, generator or `keras.utils.Sequence` instance. validation_data: Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. `validation_data` will override `validation_split`. `validation_data` could be: - tuple `(x_val, y_val)` of Numpy arrays or tensors - tuple `(x_val, y_val, val_sample_weights)` of Numpy arrays - dataset For the first two cases, `batch_size` must be provided. For the last case, `validation_steps` must be provided. shuffle: Boolean (whether to shuffle the training data before each epoch) or str (for 'batch'). 'batch' is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when `steps_per_epoch` is not `None`. class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class. sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape `(samples, sequence_length)`, to apply a different weight to every timestep of every sample. In this case you should make sure to specify `sample_weight_mode="temporal"` in `compile()`. This argument is not supported when `x` is a dataset, generator, or `keras.utils.Sequence` instance, instead provide the sample_weights as the third element of `x`. initial_epoch: Integer. Epoch at which to start training (useful for resuming a previous training run). steps_per_epoch: Integer or `None`. Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default `None` is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. If x is a `tf.data` dataset, and 'steps_per_epoch' is None, the epoch will run until the input dataset is exhausted. This argument is not supported with array inputs. validation_steps: Only relevant if `validation_data` is provided and is a `tf.data` dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch. If validation_data is a `tf.data` dataset and 'validation_steps' is None, validation will run until the `validation_data` dataset is exhausted. validation_freq: Only relevant if validation data is provided. Integer or `collections_abc.Container` instance (e.g. list, tuple, etc.). If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. `validation_freq=2` runs validation every 2 epochs. If a Container, specifies the epochs on which to run validation, e.g. `validation_freq=[1, 2, 10]` runs validation at the end of the 1st, 2nd, and 10th epochs. max_queue_size: Integer. Used for generator or `keras.utils.Sequence` input only. Maximum size for the generator queue. If unspecified, `max_queue_size` will default to 10. workers: Integer. Used for generator or `keras.utils.Sequence` input only. Maximum number of processes to spin up when using process-based threading. If unspecified, `workers` will default to 1. If 0, will execute the generator on the main thread. use_multiprocessing: Boolean. Used for generator or `keras.utils.Sequence` input only. If `True`, use process-based threading. If unspecified, `use_multiprocessing` will default to `False`. Note that because this implementation relies on multiprocessing, you should not pass non-picklable arguments to the generator as they can't be passed easily to children processes. **kwargs: Used for backwards compatibility. Returns: A `History` object. Its `History.history` attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable). Raises: RuntimeError: If the model was never compiled. ValueError: In case of mismatch between the provided input data and what the model expects.
with TIMER:
history = model.fit(training_padded,
y_train.values,
epochs=Training.epochs,
validation_data=(testing_padded, y_test.values),
verbose=Training.verbosity)
2019-09-22 16:30:20,369 graeae.timers.timer start: Started: 2019-09-22 16:30:20.369886 I0922 16:30:20.369913 139873020925760 timer.py:70] Started: 2019-09-22 16:30:20.369886 Train on 20031 samples, validate on 6678 samples Epoch 1/50 20031/20031 - 5s - loss: 0.4741 - accuracy: 0.7623 - val_loss: 0.4033 - val_accuracy: 0.8146 Epoch 2/50 20031/20031 - 4s - loss: 0.3663 - accuracy: 0.8366 - val_loss: 0.3980 - val_accuracy: 0.8196 Epoch 3/50 20031/20031 - 4s - loss: 0.3306 - accuracy: 0.8554 - val_loss: 0.3909 - val_accuracy: 0.8240 Epoch 4/50 20031/20031 - 4s - loss: 0.2990 - accuracy: 0.8721 - val_loss: 0.4148 - val_accuracy: 0.8179 Epoch 5/50 20031/20031 - 4s - loss: 0.2697 - accuracy: 0.8867 - val_loss: 0.4050 - val_accuracy: 0.8282 Epoch 6/50 20031/20031 - 4s - loss: 0.2406 - accuracy: 0.9003 - val_loss: 0.4291 - val_accuracy: 0.8212 Epoch 7/50 20031/20031 - 4s - loss: 0.2080 - accuracy: 0.9165 - val_loss: 0.4650 - val_accuracy: 0.8181 Epoch 8/50 20031/20031 - 4s - loss: 0.1824 - accuracy: 0.9272 - val_loss: 0.5053 - val_accuracy: 0.8130 Epoch 9/50 20031/20031 - 4s - loss: 0.1559 - accuracy: 0.9393 - val_loss: 0.5389 - val_accuracy: 0.8065 Epoch 10/50 20031/20031 - 4s - loss: 0.1325 - accuracy: 0.9498 - val_loss: 0.6213 - val_accuracy: 0.8044 Epoch 11/50 20031/20031 - 4s - loss: 0.1104 - accuracy: 0.9599 - val_loss: 0.6902 - val_accuracy: 0.8034 Epoch 12/50 20031/20031 - 4s - loss: 0.0966 - accuracy: 0.9646 - val_loss: 0.7437 - val_accuracy: 0.8035 Epoch 13/50 20031/20031 - 4s - loss: 0.0848 - accuracy: 0.9689 - val_loss: 0.8285 - val_accuracy: 0.7954 Epoch 14/50 20031/20031 - 4s - loss: 0.0693 - accuracy: 0.9753 - val_loss: 0.9121 - val_accuracy: 0.7934 Epoch 15/50 20031/20031 - 4s - loss: 0.0608 - accuracy: 0.9777 - val_loss: 1.0783 - val_accuracy: 0.7931 Epoch 16/50 20031/20031 - 4s - loss: 0.0529 - accuracy: 0.9810 - val_loss: 1.0620 - val_accuracy: 0.7889 Epoch 17/50 20031/20031 - 4s - loss: 0.0506 - accuracy: 0.9819 - val_loss: 1.2497 - val_accuracy: 0.7889 Epoch 18/50 20031/20031 - 4s - loss: 0.0471 - accuracy: 0.9821 - val_loss: 1.2518 - val_accuracy: 0.7963 Epoch 19/50 20031/20031 - 4s - loss: 0.0457 - accuracy: 0.9819 - val_loss: 1.3492 - val_accuracy: 0.7917 Epoch 20/50 20031/20031 - 4s - loss: 0.0392 - accuracy: 0.9851 - val_loss: 1.3702 - val_accuracy: 0.7948 Epoch 21/50 20031/20031 - 4s - loss: 0.0357 - accuracy: 0.9860 - val_loss: 1.4300 - val_accuracy: 0.7948 Epoch 22/50 20031/20031 - 4s - loss: 0.0341 - accuracy: 0.9864 - val_loss: 1.5654 - val_accuracy: 0.7889 Epoch 23/50 20031/20031 - 4s - loss: 0.0360 - accuracy: 0.9860 - val_loss: 1.5615 - val_accuracy: 0.7951 Epoch 24/50 20031/20031 - 4s - loss: 0.0307 - accuracy: 0.9872 - val_loss: 1.6964 - val_accuracy: 0.7953 Epoch 25/50 20031/20031 - 4s - loss: 0.0283 - accuracy: 0.9893 - val_loss: 1.6917 - val_accuracy: 0.7920 Epoch 26/50 20031/20031 - 4s - loss: 0.0365 - accuracy: 0.9850 - val_loss: 1.6935 - val_accuracy: 0.7944 Epoch 27/50 20031/20031 - 4s - loss: 0.0342 - accuracy: 0.9851 - val_loss: 1.7912 - val_accuracy: 0.7853 Epoch 28/50 20031/20031 - 4s - loss: 0.0301 - accuracy: 0.9879 - val_loss: 1.8194 - val_accuracy: 0.7887 Epoch 29/50 20031/20031 - 4s - loss: 0.0254 - accuracy: 0.9887 - val_loss: 1.9231 - val_accuracy: 0.7922 Epoch 30/50 20031/20031 - 4s - loss: 0.0216 - accuracy: 0.9910 - val_loss: 1.9480 - val_accuracy: 0.7914 Epoch 31/50 20031/20031 - 4s - loss: 0.0243 - accuracy: 0.9895 - val_loss: 1.9487 - val_accuracy: 0.7847 Epoch 32/50 20031/20031 - 4s - loss: 0.0241 - accuracy: 0.9891 - val_loss: 2.0333 - val_accuracy: 0.7893 Epoch 33/50 20031/20031 - 4s - loss: 0.0334 - accuracy: 0.9863 - val_loss: 1.9498 - val_accuracy: 0.7937 Epoch 34/50 20031/20031 - 4s - loss: 0.0318 - accuracy: 0.9873 - val_loss: 2.0181 - val_accuracy: 0.7942 Epoch 35/50 20031/20031 - 4s - loss: 0.0273 - accuracy: 0.9882 - val_loss: 2.0254 - val_accuracy: 0.7913 Epoch 36/50 20031/20031 - 4s - loss: 0.0236 - accuracy: 0.9897 - val_loss: 2.1159 - val_accuracy: 0.7937 Epoch 37/50 20031/20031 - 4s - loss: 0.0204 - accuracy: 0.9905 - val_loss: 2.1018 - val_accuracy: 0.7950 Epoch 38/50 20031/20031 - 4s - loss: 0.0187 - accuracy: 0.9916 - val_loss: 2.1939 - val_accuracy: 0.7947 Epoch 39/50 20031/20031 - 4s - loss: 0.0253 - accuracy: 0.9888 - val_loss: 2.2090 - val_accuracy: 0.7920 Epoch 40/50 20031/20031 - 4s - loss: 0.0270 - accuracy: 0.9889 - val_loss: 2.2737 - val_accuracy: 0.7862 Epoch 41/50 20031/20031 - 4s - loss: 0.0234 - accuracy: 0.9893 - val_loss: 2.2559 - val_accuracy: 0.7926 Epoch 42/50 20031/20031 - 4s - loss: 0.0223 - accuracy: 0.9902 - val_loss: 2.3223 - val_accuracy: 0.7884 Epoch 43/50 20031/20031 - 4s - loss: 0.0251 - accuracy: 0.9897 - val_loss: 2.2547 - val_accuracy: 0.7863 Epoch 44/50 20031/20031 - 4s - loss: 0.0209 - accuracy: 0.9900 - val_loss: 2.3917 - val_accuracy: 0.7823 Epoch 45/50 20031/20031 - 4s - loss: 0.0245 - accuracy: 0.9889 - val_loss: 2.4222 - val_accuracy: 0.7881 Epoch 46/50 20031/20031 - 4s - loss: 0.0215 - accuracy: 0.9901 - val_loss: 2.4135 - val_accuracy: 0.7869 Epoch 47/50 20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9896 - val_loss: 2.3287 - val_accuracy: 0.7823 Epoch 48/50 20031/20031 - 4s - loss: 0.0191 - accuracy: 0.9918 - val_loss: 2.4639 - val_accuracy: 0.7845 Epoch 49/50 20031/20031 - 4s - loss: 0.0183 - accuracy: 0.9911 - val_loss: 2.6068 - val_accuracy: 0.7811 Epoch 50/50 20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9897 - val_loss: 2.5152 - val_accuracy: 0.7928 2019-09-22 16:33:41,089 graeae.timers.timer end: Ended: 2019-09-22 16:33:41.089405 I0922 16:33:41.089459 139873020925760 timer.py:77] Ended: 2019-09-22 16:33:41.089405 2019-09-22 16:33:41,091 graeae.timers.timer end: Elapsed: 0:03:20.719519 I0922 16:33:41.091247 139873020925760 timer.py:78] Elapsed: 0:03:20.719519
Once again it looks like the model is overfitting, I should add a checkpoint or something.
Plot the Performance
performance = pandas.DataFrame(history.history)
plot = performance.hvplot().opts(title="CNN Sarcasm Training Performance",
width=1000,
height=800)
Embed(plot=plot, file_name="cnn_training")()
There's something very wrong with the validation. I'll have to look into that.