NLP Classification Exercise
Table of Contents
Beginning
Imports
Python
from argparse import Namespace
from functools import partial
from pathlib import Path
PyPi
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import hvplot.pandas
import numpy
import pandas
import tensorflow
Others
from graeae import (CountPercentage,
EmbedHoloviews,
SubPathLoader,
Timer,
ZipDownloader)
Set Up
The Timer
TIMER = Timer()
The Plotting
slug = "nlp-classification-exercise"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{slug}")
The Dataset
It isn't mentioned in the notebook where the data originally came from, but it looks like it's the Sentiment140 dataset, which consists of tweets whose sentiment was inferred by emoticons in each tweet.
url = "http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip"
path = Path("~/data/datasets/texts/sentiment140/").expanduser()
download = ZipDownloader(url, path)
download()
Files exist, not downloading
columns = ["polarity", "tweet_id", "datetime", "query", "user", "text"]
training = pandas.read_csv(path/"training.1600000.processed.noemoticon.csv",
encoding="latin-1", names=columns, header=None)
testing = pandas.read_csv(path/"testdata.manual.2009.06.14.csv",
encoding="latin-1", names=columns, header=None)
Some Constants
Text = Namespace(
embedding_dim = 100,
max_length = 16,
trunc_type='post',
padding_type='post',
oov_tok = "<OOV>",
training_size=16000,
)
Data = Namespace(
batch_size = 64,
shuffle_buffer_size=100,
)
Middle
The Data
print(training.sample().iloc[0])
polarity 4 tweet_id 1468852290 datetime Tue Apr 07 04:04:10 PDT 2009 query NO_QUERY user leawoodward text Def off now...unexpected day out tomorrow so s... Name: 806643, dtype: object
CountPercentage(training.polarity)()
Value | Count | Percent (%) |
---|---|---|
4 | 800,000 | 50.00 |
0 | 800,000 | 50.00 |
The polarity
is what might also be called the "sentiment" of the tweet - 0 means a negative tweet and 4 means a positive tweet.
But, for our purposes, we would be better off if the positive polarity was 1
, not 4
, so let's convert it.
training.loc[training.polarity==4, "polarity"] = 1
counts = CountPercentage(training.polarity)()
Value | Count | Percent (%) |
---|---|---|
1 | 800,000 | 50.00 |
0 | 800,000 | 50.00 |
The Tokenizer
As you can see from the sample, the data is still in text form so we need to convert it to a numeric form with a Tokenizer.
First I'll Lower-case it.
training.loc[:, "text"] = training.text.str.lower()
Next we'll fit it to our text.
tokenizer = Tokenizer()
with TIMER:
tokenizer.fit_on_texts(training.text.values)
2019-10-10 07:25:09,065 graeae.timers.timer start: Started: 2019-10-10 07:25:09.065039 WARNING: Logging before flag parsing goes to stderr. I1010 07:25:09.065394 140436771002176 timer.py:70] Started: 2019-10-10 07:25:09.065039 2019-10-10 07:25:45,389 graeae.timers.timer end: Ended: 2019-10-10 07:25:45.389540 I1010 07:25:45.389598 140436771002176 timer.py:77] Ended: 2019-10-10 07:25:45.389540 2019-10-10 07:25:45,391 graeae.timers.timer end: Elapsed: 0:00:36.324501 I1010 07:25:45.391984 140436771002176 timer.py:78] Elapsed: 0:00:36.324501
Now, we can store some of it's values in variables for convenience.
word_index = tokenizer.word_index
vocabulary_size = len(tokenizer.word_index)
Now, we'll convert the texts to sequences and pad them so they are all the same length.
with TIMER:
sequences = tokenizer.texts_to_sequences(training.text.values)
padded = pad_sequences(sequences, maxlen=Text.max_length,
truncating=Text.trunc_type)
splits = train_test_split(
padded, training.polarity, test_size=.2)
training_sequences, test_sequences, training_labels, test_labels = splits
2019-10-10 07:25:51,057 graeae.timers.timer start: Started: 2019-10-10 07:25:51.057684 I1010 07:25:51.057712 140436771002176 timer.py:70] Started: 2019-10-10 07:25:51.057684 2019-10-10 07:26:33,530 graeae.timers.timer end: Ended: 2019-10-10 07:26:33.530338 I1010 07:26:33.530381 140436771002176 timer.py:77] Ended: 2019-10-10 07:26:33.530338 2019-10-10 07:26:33,531 graeae.timers.timer end: Elapsed: 0:00:42.472654 I1010 07:26:33.531477 140436771002176 timer.py:78] Elapsed: 0:00:42.472654
Now convert them to datasets.
training_dataset = tensorflow.data.Dataset.from_tensor_slices(
(training_sequences, training_labels)
)
testing_dataset = tensorflow.data.Dataset.from_tensor_slices(
(test_sequences, test_labels)
)
training_dataset = training_dataset.shuffle(Data.shuffle_buffer_size).batch(Data.batch_size)
testing_dataset = testing_dataset.shuffle(Data.shuffle_buffer_size).batch(Data.batch_size)
GloVe
GloVe is short for Global Vectors for Word Representation. It is an unsupervised algorithm that creates vector representations for words. They have a site where you can download pre-trained models or get the code and train one yourself. We're going to use one of their pre-trained models.
path = Path("~/models/glove/").expanduser()
url = "http://nlp.stanford.edu/data/glove.6B.zip"
ZipDownloader(url, path)()
Files exist, not downloading
The GloVe data is stored as a series of space separated lines with the first column being the word that's encoded and the rest of the columns being the values for the vector. To make this work we're going to split the word off from the vector and put each into a dictionary.
embeddings = {}
with TIMER:
with open(path/"glove.6B.100d.txt") as lines:
for line in lines:
tokens = line.split()
embeddings[tokens[0]] = numpy.array(tokens[1:])
2019-10-06 18:55:11,592 graeae.timers.timer start: Started: 2019-10-06 18:55:11.592880 I1006 18:55:11.592908 140055379531584 timer.py:70] Started: 2019-10-06 18:55:11.592880 2019-10-06 18:55:21,542 graeae.timers.timer end: Ended: 2019-10-06 18:55:21.542689 I1006 18:55:21.542738 140055379531584 timer.py:77] Ended: 2019-10-06 18:55:21.542689 2019-10-06 18:55:21,544 graeae.timers.timer end: Elapsed: 0:00:09.949809 I1006 18:55:21.544939 140055379531584 timer.py:78] Elapsed: 0:00:09.949809
print(f"{len(embeddings):,}")
400,000
So, our vocabulary consists of 400,000 "words" (tokens is more accurate, since they also include punctuation). The problem we have to deal with next is that our data set wasn't part of the dataset used to train the embeddings, so there will probably be some tokens in our data set that aren't in the embeddings. To handle this we need to add zeroed embeddings for the extra tokens.
Rather than adding to the dict, we'll create a matrix of zeros with rows for each word in our datasets vocabulary, then we'll iterate over the words in our dataset and if there's a match in the GloVE embeddings we'll insert it into the matrix.
with TIMER:
embeddings_matrix = numpy.zeros((vocabulary_size+1, Text.embedding_dim));
for word, index in word_index.items():
embedding_vector = embeddings.get(word);
if embedding_vector is not None:
embeddings_matrix[index] = embedding_vector;
2019-10-06 18:55:46,577 graeae.timers.timer start: Started: 2019-10-06 18:55:46.577855 I1006 18:55:46.577886 140055379531584 timer.py:70] Started: 2019-10-06 18:55:46.577855 2019-10-06 18:55:51,374 graeae.timers.timer end: Ended: 2019-10-06 18:55:51.374706 I1006 18:55:51.374763 140055379531584 timer.py:77] Ended: 2019-10-06 18:55:51.374706 2019-10-06 18:55:51,377 graeae.timers.timer end: Elapsed: 0:00:04.796851 I1006 18:55:51.377207 140055379531584 timer.py:78] Elapsed: 0:00:04.796851
print(f"{len(embeddings_matrix):,}")
690,961
The Models
A CNN
- Build
convoluted_model = tensorflow.keras.Sequential([ tensorflow.keras.layers.Embedding( vocabulary_size + 1, Text.embedding_dim, input_length=Text.max_length, weights=[embeddings_matrix], trainable=False), tensorflow.keras.layers.Conv1D(filters=128, kernel_size=5, activation='relu'), tensorflow.keras.layers.GlobalMaxPooling1D(), tensorflow.keras.layers.Dense(24, activation='relu'), tensorflow.keras.layers.Dense(1, activation='sigmoid') ]) convoluted_model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
print(convoluted_model.summary())
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 16, 100) 69096100 _________________________________________________________________ conv1d (Conv1D) (None, 12, 128) 64128 _________________________________________________________________ global_max_pooling1d (Global (None, 128) 0 _________________________________________________________________ dense (Dense) (None, 24) 3096 _________________________________________________________________ dense_1 (Dense) (None, 1) 25 ================================================================= Total params: 69,163,349 Trainable params: 67,249 Non-trainable params: 69,096,100 _________________________________________________________________ None
- Train
Training = Namespace( size = 0.75, epochs = 2, verbosity = 2, batch_size=128, )
with TIMER: cnn_history = convoluted_model.fit(training_dataset, epochs=Training.epochs, validation_data=testing_dataset, verbose=Training.verbosity)
2019-10-10 07:27:04,921 graeae.timers.timer start: Started: 2019-10-10 07:27:04.921617 I1010 07:27:04.921657 140436771002176 timer.py:70] Started: 2019-10-10 07:27:04.921617 Epoch 1/2 W1010 07:27:05.154920 140436771002176 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where 20000/20000 - 4964s - loss: 0.5091 - accuracy: 0.7454 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00 Epoch 2/2 20000/20000 - 4935s - loss: 0.4790 - accuracy: 0.7671 - val_loss: 0.4782 - val_accuracy: 0.7677 2019-10-10 10:12:04,382 graeae.timers.timer end: Ended: 2019-10-10 10:12:04.382359 I1010 10:12:04.382491 140436771002176 timer.py:77] Ended: 2019-10-10 10:12:04.382359 2019-10-10 10:12:04,384 graeae.timers.timer end: Elapsed: 2:44:59.460742 I1010 10:12:04.384716 140436771002176 timer.py:78] Elapsed: 2:44:59.460742
- Some Plotting
performance = pandas.DataFrame(cnn_history.history) plot = performance.hvplot().opts(title="CNN Twitter Sentiment Training Performance", width=1000, height=800) Embed(plot=plot, file_name="cnn_training")()
End
Citations
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.