BBC News Classification

Cloistered Monkey

2019-08-26 15:28

Beginning

Imports

Python

from functools import partial
from pathlib import Path
import csv
import random

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import hvplot.pandas
import numpy
import pandas
import spacy
import tensorflow

Graeae

from graeae import EmbedHoloviews, SubPathLoader, Timer

Setup

The Timer

TIMER = Timer()

The Environment

ENVIRONMENT = SubPathLoader('DATASETS')

Spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_lg")

Plotting

SLUG = "bbc-news-classification"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")

Middle

Load the Datasets

path = Path(ENVIRONMENT["BBC_NEWS"]).expanduser()

texts = []
labels = []
with TIMER:
    with path.open() as csvfile:
        lines = csv.DictReader(csvfile)
        for line in lines:
            labels.append(line["category"])
            texts.append(nlp(line["text"]))

WARNING: Logging before flag parsing goes to stderr.
I0908 13:32:14.804769 139839933974336 environment.py:35] Environment Path: /home/athena/.env
I0908 13:32:14.806000 139839933974336 environment.py:90] Environment Path: /home/athena/.config/datasets/env
2019-09-08 13:32:14,806 graeae.timers.timer start: Started: 2019-09-08 13:32:14.806861
I0908 13:32:14.806965 139839933974336 timer.py:70] Started: 2019-09-08 13:32:14.806861
2019-09-08 13:33:37,430 graeae.timers.timer end: Ended: 2019-09-08 13:33:37.430228
I0908 13:33:37.430259 139839933974336 timer.py:77] Ended: 2019-09-08 13:33:37.430228
2019-09-08 13:33:37,431 graeae.timers.timer end: Elapsed: 0:01:22.623367
I0908 13:33:37.431128 139839933974336 timer.py:78] Elapsed: 0:01:22.623367

print(texts[random.randrange(len(texts))])

candidate resigns over bnp link a prospective candidate for the uk independence party (ukip) has resigned after admitting a  brief attachment  to the british national party(bnp).  nicholas betts-green  who had been selected to fight the suffolk coastal seat  quit after reports in a newspaper that he attended a bnp meeting. the former teacher confirmed he had attended the meeting but said that was the only contact he had with the group. mr betts-green resigned after being questioned by the party s leadership. a ukip spokesman said mr betts-green s resignation followed disclosures in the east anglian daily times last month about his attendance at a bnp meeting.  he did once attend a bnp meeting. he did not like what he saw and heard and will take no further part of it   the spokesman added. a meeting of suffolk coastal ukip members is due to be held next week to discuss a replacement. mr betts-green  of woodbridge  suffolk  has also resigned as ukip s branch chairman.

So, it looks like the text has been lower-cased but there's still punctuation and extra white-space.

print(f"Rows: {len(labels):,}")
print(f"Unique Labels: {len(set(labels)):,}")

Rows: 2,225
Unique Labels: 5

Since there's only five maybe we should plot it.

labels_frame = pandas.DataFrame({"label": labels})
counts = labels_frame.label.value_counts().reset_index().rename(
    columns={"index": "Category", "label": "Articles"})
plot = counts.hvplot.bar("Category", "Articles").opts(
    title="Count of BBC News Articles by Category",
    height=800, width=1000)
Embed(plot=plot, file_name="bbc_category_counts")()

It looks like the categories are somewhat unevenly distributed. Now to normalize the tokens.

with TIMER:
    cleaned = [[token.lemma_ for token in text if not any((token.is_stop, token.is_space, token.is_punct))]
               for text in texts]

2019-09-08 13:33:40,257 graeae.timers.timer start: Started: 2019-09-08 13:33:40.257908
I0908 13:33:40.257930 139839933974336 timer.py:70] Started: 2019-09-08 13:33:40.257908
2019-09-08 13:33:40,810 graeae.timers.timer end: Ended: 2019-09-08 13:33:40.810135
I0908 13:33:40.810176 139839933974336 timer.py:77] Ended: 2019-09-08 13:33:40.810135
2019-09-08 13:33:40,811 graeae.timers.timer end: Elapsed: 0:00:00.552227
I0908 13:33:40.811067 139839933974336 timer.py:78] Elapsed: 0:00:00.552227

The Tokenizers

Even though I've already tokenized the texts, we need to eventually one-hot-encode them so I'll use the tensorflow keras Tokenizer.

Note: The labels tokenizer doesn't get the out-of-vocabulary token, only the text-tokenizer does.

tokenizer = Tokenizer(num_words=1000, oov_token="<OOV>")
labels_tokenizer = Tokenizer()
labels_tokenizer.fit_on_texts(labels)

The num_words is the total amount of words that will be kept in the word index - I don't know why a thousand, I just found that in the "answer" notebook. The oov_token is what's used when a word is encountered outside of the words we're building into our word-index (Out Of Vocabulary). The next step is to create the word-index by fitting the tokenizer to the text.

with TIMER:
    tokenizer.fit_on_texts(cleaned)

2019-09-08 14:59:30,671 graeae.timers.timer start: Started: 2019-09-08 14:59:30.671536
I0908 14:59:30.671563 139839933974336 timer.py:70] Started: 2019-09-08 14:59:30.671536
2019-09-08 14:59:30,862 graeae.timers.timer end: Ended: 2019-09-08 14:59:30.862483
I0908 14:59:30.862523 139839933974336 timer.py:77] Ended: 2019-09-08 14:59:30.862483
2019-09-08 14:59:30,863 graeae.timers.timer end: Elapsed: 0:00:00.190947
I0908 14:59:30.863504 139839933974336 timer.py:78] Elapsed: 0:00:00.190947

The tokenizer now has a dictionary named word_index that holds the words:index pairs for all the tokens found (it only uses the num_words when you call tokenizer's methods according to Stack Overflow).

print(f"{len(tokenizer.word_index):,}")

24,339

Making the Sequences

I've trained the Tokenizer so that it has a word-index, but now we have to one hot encode our texts and pad them so they're all the same length.

MAX_LENGTH = 120
sequences = tokenizer.texts_to_sequences(cleaned)
padded = pad_sequences(sequences, padding="post", maxlen=MAX_LENGTH)
labels_sequenced = labels_tokenizer.texts_to_sequences(labels)

Make training and testing sets

TESTING = 0.2
x_train, x_test, y_train, y_test = train_test_split(
    padded, labels_sequenced,
    test_size=TESTING)
x_train, x_validation, y_train, y_validation = train_test_split(
    x_train, y_train, test_size=TESTING)

y_train = numpy.array(y_train)
y_test = numpy.array(y_test)
y_validation = numpy.array(y_validation)

print(f"Training: {x_train.shape}")
print(f"Validation: {x_validation.shape}")
print(f"Testing: {x_test.shape}")

Training: (1424, 120)
Validation: (356, 120)
Testing: (445, 120)

Note: I originally forgot to pass the TESTING variable with the keyword test_size and got an error that I couldn't use a Singleton array - don't forget the keywords when you pass in anything other than the data to train_test_split.

The Model

vocabulary_size = 1000
embedding_dimension = 16
max_length=120

model = tensorflow.keras.Sequential([
    layers.Embedding(vocabulary_size, embedding_dimension,
                     input_length=max_length),
    layers.GlobalAveragePooling1D(),
    layers.Dense(24, activation="relu"),
    layers.Dense(6, activation="softmax"),
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 120, 16)           16000     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                408       
_________________________________________________________________
dense_3 (Dense)              (None, 6)                 150       
=================================================================
Total params: 16,558
Trainable params: 16,558
Non-trainable params: 0
_________________________________________________________________
None

model.fit(x_train, y_train, epochs=30,
          validation_data=(x_validation, y_validation), verbose=2)

Train on 1424 samples, validate on 356 samples
Epoch 1/30
1424/1424 - 0s - loss: 1.7623 - accuracy: 0.2879 - val_loss: 1.7257 - val_accuracy: 0.5000
Epoch 2/30
1424/1424 - 0s - loss: 1.6871 - accuracy: 0.5190 - val_loss: 1.6332 - val_accuracy: 0.5281
Epoch 3/30
1424/1424 - 0s - loss: 1.5814 - accuracy: 0.4782 - val_loss: 1.5118 - val_accuracy: 0.4944
Epoch 4/30
1424/1424 - 0s - loss: 1.4417 - accuracy: 0.4677 - val_loss: 1.3543 - val_accuracy: 0.5365
Epoch 5/30
1424/1424 - 0s - loss: 1.2706 - accuracy: 0.5934 - val_loss: 1.1850 - val_accuracy: 0.7022
Epoch 6/30
1424/1424 - 0s - loss: 1.1075 - accuracy: 0.6749 - val_loss: 1.0387 - val_accuracy: 0.8006
Epoch 7/30
1424/1424 - 0s - loss: 0.9606 - accuracy: 0.8483 - val_loss: 0.9081 - val_accuracy: 0.8567
Epoch 8/30
1424/1424 - 0s - loss: 0.8244 - accuracy: 0.8869 - val_loss: 0.7893 - val_accuracy: 0.8848
Epoch 9/30
1424/1424 - 0s - loss: 0.6963 - accuracy: 0.9164 - val_loss: 0.6747 - val_accuracy: 0.8961
Epoch 10/30
1424/1424 - 0s - loss: 0.5815 - accuracy: 0.9228 - val_loss: 0.5767 - val_accuracy: 0.9185
Epoch 11/30
1424/1424 - 0s - loss: 0.4831 - accuracy: 0.9375 - val_loss: 0.4890 - val_accuracy: 0.9270
Epoch 12/30
1424/1424 - 0s - loss: 0.3991 - accuracy: 0.9473 - val_loss: 0.4195 - val_accuracy: 0.9326
Epoch 13/30
1424/1424 - 0s - loss: 0.3321 - accuracy: 0.9508 - val_loss: 0.3669 - val_accuracy: 0.9438
Epoch 14/30
1424/1424 - 0s - loss: 0.2800 - accuracy: 0.9572 - val_loss: 0.3268 - val_accuracy: 0.9494
Epoch 15/30
1424/1424 - 0s - loss: 0.2385 - accuracy: 0.9656 - val_loss: 0.2936 - val_accuracy: 0.9438
Epoch 16/30
1424/1424 - 0s - loss: 0.2053 - accuracy: 0.9740 - val_loss: 0.2693 - val_accuracy: 0.9466
Epoch 17/30
1424/1424 - 0s - loss: 0.1775 - accuracy: 0.9761 - val_loss: 0.2501 - val_accuracy: 0.9466
Epoch 18/30
1424/1424 - 0s - loss: 0.1557 - accuracy: 0.9789 - val_loss: 0.2332 - val_accuracy: 0.9494
Epoch 19/30
1424/1424 - 0s - loss: 0.1362 - accuracy: 0.9831 - val_loss: 0.2189 - val_accuracy: 0.9522
Epoch 20/30
1424/1424 - 0s - loss: 0.1209 - accuracy: 0.9853 - val_loss: 0.2082 - val_accuracy: 0.9551
Epoch 21/30
1424/1424 - 0s - loss: 0.1070 - accuracy: 0.9860 - val_loss: 0.1979 - val_accuracy: 0.9579
Epoch 22/30
1424/1424 - 0s - loss: 0.0952 - accuracy: 0.9888 - val_loss: 0.1897 - val_accuracy: 0.9551
Epoch 23/30
1424/1424 - 0s - loss: 0.0854 - accuracy: 0.9902 - val_loss: 0.1815 - val_accuracy: 0.9579
Epoch 24/30
1424/1424 - 0s - loss: 0.0765 - accuracy: 0.9916 - val_loss: 0.1761 - val_accuracy: 0.9522
Epoch 25/30
1424/1424 - 0s - loss: 0.0689 - accuracy: 0.9930 - val_loss: 0.1729 - val_accuracy: 0.9579
Epoch 26/30
1424/1424 - 0s - loss: 0.0618 - accuracy: 0.9951 - val_loss: 0.1680 - val_accuracy: 0.9551
Epoch 27/30
1424/1424 - 0s - loss: 0.0559 - accuracy: 0.9958 - val_loss: 0.1633 - val_accuracy: 0.9551
Epoch 28/30
1424/1424 - 0s - loss: 0.0505 - accuracy: 0.9958 - val_loss: 0.1594 - val_accuracy: 0.9579
Epoch 29/30
1424/1424 - 0s - loss: 0.0457 - accuracy: 0.9965 - val_loss: 0.1559 - val_accuracy: 0.9522
Epoch 30/30
1424/1424 - 0s - loss: 0.0416 - accuracy: 0.9972 - val_loss: 0.1544 - val_accuracy: 0.9551

It seems to get good suprisingly fast - it might be overfitting toward the end.

loss, accuracy =model.evaluate(x_test, y_test, verbose=0)
print(f"Loss: {loss: .2f} Accuracy: {accuracy:.2f}")

Loss:  0.16 Accuracy: 0.95

It does pretty well, even on the test set.

Plotting the Performance

data = pandas.DataFrame(model.history.history)
plot = data.hvplot().opts(title="Training Performance", width=1000, height=800)
Embed(plot=plot, file_name="model_performance")()

Unlike with the image classifications, the validation performance never quite matches the training performance (although it's quite good), probably because we aren't doing any kind of augmentation the way you tend to do with images.

End

Okay, so we seem to have a decent model, but is that really the end-game? No, we want to be able to predict what classification a new input should get.

index_to_label = {value:key for (key, value) in labels_tokenizer.word_index.items()}

def category(text: str) -> None:
    """Categorizes the text

    Args:
     text: text to categorize
    """
    text = tokenizer.texts_to_sequences([text])
    predictions = model.predict(pad_sequences(text, maxlen=MAX_LENGTH))
    print(f"Predicted Category: {index_to_label[predictions.argmax()]}")
    return

text = "crickets are nutritious and delicious but make for such a silly game"
category(text)

Predicted Category: sport

text = "i like butts that are big and round, something something like a xxx throw down, and so does the house of parliament"
category(text)

Predicted Category: sport

It kind of looks like it's biased toward sports.

text = "tv future hand viewer home theatre"
category(text)

Predicted Category: sport

Something isn't right here.

Table of Contents