NLP Classification Exercise

Beginning

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import hvplot.pandas
import numpy
import pandas
import tensorflow

Others

from graeae import (CountPercentage,
                    EmbedHoloviews,
                    SubPathLoader,
                    Timer,
                    ZipDownloader)

Set Up

The Timer

TIMER = Timer()

The Plotting

slug = "nlp-classification-exercise"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{slug}")

The Dataset

It isn't mentioned in the notebook where the data originally came from, but it looks like it's the Sentiment140 dataset, which consists of tweets whose sentiment was inferred by emoticons in each tweet.

url = "http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip"
path = Path("~/data/datasets/texts/sentiment140/").expanduser()
download = ZipDownloader(url, path)
download()
Files exist, not downloading
columns = ["polarity", "tweet_id", "datetime", "query", "user", "text"]
training = pandas.read_csv(path/"training.1600000.processed.noemoticon.csv", 
                           encoding="latin-1", names=columns, header=None)
testing = pandas.read_csv(path/"testdata.manual.2009.06.14.csv", 
                           encoding="latin-1", names=columns, header=None)

Some Constants

Text = Namespace(
    embedding_dim = 100,
    max_length = 16,
    trunc_type='post',
    padding_type='post',
    oov_tok = "<OOV>",
    training_size=16000,
)
Data = Namespace(
    batch_size = 64,
    shuffle_buffer_size=100,
)

Middle

The Data

print(training.sample().iloc[0])
polarity                                                    4
tweet_id                                           1468852290
datetime                         Tue Apr 07 04:04:10 PDT 2009
query                                                NO_QUERY
user                                              leawoodward
text        Def off now...unexpected day out tomorrow so s...
Name: 806643, dtype: object
CountPercentage(training.polarity)()
Value Count Percent (%)
4 800,000 50.00
0 800,000 50.00

The polarity is what might also be called the "sentiment" of the tweet - 0 means a negative tweet and 4 means a positive tweet.

But, for our purposes, we would be better off if the positive polarity was 1, not 4, so let's convert it.

training.loc[training.polarity==4, "polarity"] = 1
counts = CountPercentage(training.polarity)()
Value Count Percent (%)
1 800,000 50.00
0 800,000 50.00

The Tokenizer

As you can see from the sample, the data is still in text form so we need to convert it to a numeric form with a Tokenizer.

First I'll Lower-case it.

training.loc[:, "text"] = training.text.str.lower()

Next we'll fit it to our text.

tokenizer = Tokenizer()
with TIMER:
    tokenizer.fit_on_texts(training.text.values)
2019-10-10 07:25:09,065 graeae.timers.timer start: Started: 2019-10-10 07:25:09.065039
WARNING: Logging before flag parsing goes to stderr.
I1010 07:25:09.065394 140436771002176 timer.py:70] Started: 2019-10-10 07:25:09.065039
2019-10-10 07:25:45,389 graeae.timers.timer end: Ended: 2019-10-10 07:25:45.389540
I1010 07:25:45.389598 140436771002176 timer.py:77] Ended: 2019-10-10 07:25:45.389540
2019-10-10 07:25:45,391 graeae.timers.timer end: Elapsed: 0:00:36.324501
I1010 07:25:45.391984 140436771002176 timer.py:78] Elapsed: 0:00:36.324501

Now, we can store some of it's values in variables for convenience.

word_index = tokenizer.word_index
vocabulary_size = len(tokenizer.word_index)

Now, we'll convert the texts to sequences and pad them so they are all the same length.

with TIMER:
    sequences = tokenizer.texts_to_sequences(training.text.values)
    padded = pad_sequences(sequences, maxlen=Text.max_length,
                           truncating=Text.trunc_type)

    splits = train_test_split(
        padded, training.polarity, test_size=.2)

    training_sequences, test_sequences, training_labels, test_labels = splits
2019-10-10 07:25:51,057 graeae.timers.timer start: Started: 2019-10-10 07:25:51.057684
I1010 07:25:51.057712 140436771002176 timer.py:70] Started: 2019-10-10 07:25:51.057684
2019-10-10 07:26:33,530 graeae.timers.timer end: Ended: 2019-10-10 07:26:33.530338
I1010 07:26:33.530381 140436771002176 timer.py:77] Ended: 2019-10-10 07:26:33.530338
2019-10-10 07:26:33,531 graeae.timers.timer end: Elapsed: 0:00:42.472654
I1010 07:26:33.531477 140436771002176 timer.py:78] Elapsed: 0:00:42.472654

Now convert them to datasets.

training_dataset = tensorflow.data.Dataset.from_tensor_slices(
    (training_sequences, training_labels)
)

testing_dataset = tensorflow.data.Dataset.from_tensor_slices(
    (test_sequences, test_labels)
)

training_dataset = training_dataset.shuffle(Data.shuffle_buffer_size).batch(Data.batch_size)
testing_dataset = testing_dataset.shuffle(Data.shuffle_buffer_size).batch(Data.batch_size)

GloVe

GloVe is short for Global Vectors for Word Representation. It is an unsupervised algorithm that creates vector representations for words. They have a site where you can download pre-trained models or get the code and train one yourself. We're going to use one of their pre-trained models.

path = Path("~/models/glove/").expanduser()
url = "http://nlp.stanford.edu/data/glove.6B.zip"
ZipDownloader(url, path)()
Files exist, not downloading

The GloVe data is stored as a series of space separated lines with the first column being the word that's encoded and the rest of the columns being the values for the vector. To make this work we're going to split the word off from the vector and put each into a dictionary.

embeddings = {}
with TIMER:
    with open(path/"glove.6B.100d.txt") as lines:
        for line in lines:
            tokens = line.split()
            embeddings[tokens[0]] = numpy.array(tokens[1:])
2019-10-06 18:55:11,592 graeae.timers.timer start: Started: 2019-10-06 18:55:11.592880
I1006 18:55:11.592908 140055379531584 timer.py:70] Started: 2019-10-06 18:55:11.592880
2019-10-06 18:55:21,542 graeae.timers.timer end: Ended: 2019-10-06 18:55:21.542689
I1006 18:55:21.542738 140055379531584 timer.py:77] Ended: 2019-10-06 18:55:21.542689
2019-10-06 18:55:21,544 graeae.timers.timer end: Elapsed: 0:00:09.949809
I1006 18:55:21.544939 140055379531584 timer.py:78] Elapsed: 0:00:09.949809
print(f"{len(embeddings):,}")
400,000

So, our vocabulary consists of 400,000 "words" (tokens is more accurate, since they also include punctuation). The problem we have to deal with next is that our data set wasn't part of the dataset used to train the embeddings, so there will probably be some tokens in our data set that aren't in the embeddings. To handle this we need to add zeroed embeddings for the extra tokens.

Rather than adding to the dict, we'll create a matrix of zeros with rows for each word in our datasets vocabulary, then we'll iterate over the words in our dataset and if there's a match in the GloVE embeddings we'll insert it into the matrix.

with TIMER:
    embeddings_matrix = numpy.zeros((vocabulary_size+1, Text.embedding_dim));
    for word, index in word_index.items():
        embedding_vector = embeddings.get(word);
        if embedding_vector is not None:
            embeddings_matrix[index] = embedding_vector;
2019-10-06 18:55:46,577 graeae.timers.timer start: Started: 2019-10-06 18:55:46.577855
I1006 18:55:46.577886 140055379531584 timer.py:70] Started: 2019-10-06 18:55:46.577855
2019-10-06 18:55:51,374 graeae.timers.timer end: Ended: 2019-10-06 18:55:51.374706
I1006 18:55:51.374763 140055379531584 timer.py:77] Ended: 2019-10-06 18:55:51.374706
2019-10-06 18:55:51,377 graeae.timers.timer end: Elapsed: 0:00:04.796851
I1006 18:55:51.377207 140055379531584 timer.py:78] Elapsed: 0:00:04.796851
print(f"{len(embeddings_matrix):,}")
690,961

The Models

A CNN

  • Build
    convoluted_model = tensorflow.keras.Sequential([
        tensorflow.keras.layers.Embedding(
            vocabulary_size + 1,
            Text.embedding_dim,
            input_length=Text.max_length,
            weights=[embeddings_matrix],
            trainable=False),
        tensorflow.keras.layers.Conv1D(filters=128,
                                       kernel_size=5,
        activation='relu'),
        tensorflow.keras.layers.GlobalMaxPooling1D(),
        tensorflow.keras.layers.Dense(24, activation='relu'),
        tensorflow.keras.layers.Dense(1, activation='sigmoid')
    ])
    convoluted_model.compile(loss="binary_crossentropy", optimizer="rmsprop",
                             metrics=["accuracy"])
    
    print(convoluted_model.summary())
    
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding (Embedding)        (None, 16, 100)           69096100  
    _________________________________________________________________
    conv1d (Conv1D)              (None, 12, 128)           64128     
    _________________________________________________________________
    global_max_pooling1d (Global (None, 128)               0         
    _________________________________________________________________
    dense (Dense)                (None, 24)                3096      
    _________________________________________________________________
    dense_1 (Dense)              (None, 1)                 25        
    =================================================================
    Total params: 69,163,349
    Trainable params: 67,249
    Non-trainable params: 69,096,100
    _________________________________________________________________
    None
    
  • Train
    Training = Namespace(
        size = 0.75,
        epochs = 2,
        verbosity = 2,
        batch_size=128,
        )
    
    with TIMER:
        cnn_history = convoluted_model.fit(training_dataset,
                                           epochs=Training.epochs,
                                           validation_data=testing_dataset,
                                           verbose=Training.verbosity)
    
    2019-10-10 07:27:04,921 graeae.timers.timer start: Started: 2019-10-10 07:27:04.921617
    I1010 07:27:04.921657 140436771002176 timer.py:70] Started: 2019-10-10 07:27:04.921617
    Epoch 1/2
    W1010 07:27:05.154920 140436771002176 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.where in 2.0, which has the same broadcast rule as np.where
    20000/20000 - 4964s - loss: 0.5091 - accuracy: 0.7454 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
    Epoch 2/2
    20000/20000 - 4935s - loss: 0.4790 - accuracy: 0.7671 - val_loss: 0.4782 - val_accuracy: 0.7677
    2019-10-10 10:12:04,382 graeae.timers.timer end: Ended: 2019-10-10 10:12:04.382359
    I1010 10:12:04.382491 140436771002176 timer.py:77] Ended: 2019-10-10 10:12:04.382359
    2019-10-10 10:12:04,384 graeae.timers.timer end: Elapsed: 2:44:59.460742
    I1010 10:12:04.384716 140436771002176 timer.py:78] Elapsed: 2:44:59.460742
    
  • Some Plotting
    performance = pandas.DataFrame(cnn_history.history)
    plot = performance.hvplot().opts(title="CNN Twitter Sentiment Training Performance",
                                     width=1000,
                                     height=800)
    Embed(plot=plot, file_name="cnn_training")()
    

End

Citations

  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Raw