BBC News Classification
Table of Contents
Beginning
Imports
Python
from functools import partial
from pathlib import Path
import csv
import random
PyPi
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import hvplot.pandas
import numpy
import pandas
import spacy
import tensorflow
Graeae
from graeae import EmbedHoloviews, SubPathLoader, Timer
Setup
The Timer
TIMER = Timer()
The Environment
ENVIRONMENT = SubPathLoader('DATASETS')
Spacy
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_lg")
Plotting
SLUG = "bbc-news-classification"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")
Middle
Load the Datasets
path = Path(ENVIRONMENT["BBC_NEWS"]).expanduser()
texts = []
labels = []
with TIMER:
with path.open() as csvfile:
lines = csv.DictReader(csvfile)
for line in lines:
labels.append(line["category"])
texts.append(nlp(line["text"]))
WARNING: Logging before flag parsing goes to stderr. I0908 13:32:14.804769 139839933974336 environment.py:35] Environment Path: /home/athena/.env I0908 13:32:14.806000 139839933974336 environment.py:90] Environment Path: /home/athena/.config/datasets/env 2019-09-08 13:32:14,806 graeae.timers.timer start: Started: 2019-09-08 13:32:14.806861 I0908 13:32:14.806965 139839933974336 timer.py:70] Started: 2019-09-08 13:32:14.806861 2019-09-08 13:33:37,430 graeae.timers.timer end: Ended: 2019-09-08 13:33:37.430228 I0908 13:33:37.430259 139839933974336 timer.py:77] Ended: 2019-09-08 13:33:37.430228 2019-09-08 13:33:37,431 graeae.timers.timer end: Elapsed: 0:01:22.623367 I0908 13:33:37.431128 139839933974336 timer.py:78] Elapsed: 0:01:22.623367
print(texts[random.randrange(len(texts))])
candidate resigns over bnp link a prospective candidate for the uk independence party (ukip) has resigned after admitting a brief attachment to the british national party(bnp). nicholas betts-green who had been selected to fight the suffolk coastal seat quit after reports in a newspaper that he attended a bnp meeting. the former teacher confirmed he had attended the meeting but said that was the only contact he had with the group. mr betts-green resigned after being questioned by the party s leadership. a ukip spokesman said mr betts-green s resignation followed disclosures in the east anglian daily times last month about his attendance at a bnp meeting. he did once attend a bnp meeting. he did not like what he saw and heard and will take no further part of it the spokesman added. a meeting of suffolk coastal ukip members is due to be held next week to discuss a replacement. mr betts-green of woodbridge suffolk has also resigned as ukip s branch chairman.
So, it looks like the text has been lower-cased but there's still punctuation and extra white-space.
print(f"Rows: {len(labels):,}")
print(f"Unique Labels: {len(set(labels)):,}")
Rows: 2,225 Unique Labels: 5
Since there's only five maybe we should plot it.
labels_frame = pandas.DataFrame({"label": labels})
counts = labels_frame.label.value_counts().reset_index().rename(
columns={"index": "Category", "label": "Articles"})
plot = counts.hvplot.bar("Category", "Articles").opts(
title="Count of BBC News Articles by Category",
height=800, width=1000)
Embed(plot=plot, file_name="bbc_category_counts")()
It looks like the categories are somewhat unevenly distributed. Now to normalize the tokens.
with TIMER:
cleaned = [[token.lemma_ for token in text if not any((token.is_stop, token.is_space, token.is_punct))]
for text in texts]
2019-09-08 13:33:40,257 graeae.timers.timer start: Started: 2019-09-08 13:33:40.257908 I0908 13:33:40.257930 139839933974336 timer.py:70] Started: 2019-09-08 13:33:40.257908 2019-09-08 13:33:40,810 graeae.timers.timer end: Ended: 2019-09-08 13:33:40.810135 I0908 13:33:40.810176 139839933974336 timer.py:77] Ended: 2019-09-08 13:33:40.810135 2019-09-08 13:33:40,811 graeae.timers.timer end: Elapsed: 0:00:00.552227 I0908 13:33:40.811067 139839933974336 timer.py:78] Elapsed: 0:00:00.552227
The Tokenizers
Even though I've already tokenized the texts, we need to eventually one-hot-encode them so I'll use the tensorflow keras Tokenizer.
Note: The labels tokenizer doesn't get the out-of-vocabulary token, only the text-tokenizer does.
tokenizer = Tokenizer(num_words=1000, oov_token="<OOV>")
labels_tokenizer = Tokenizer()
labels_tokenizer.fit_on_texts(labels)
The num_words
is the total amount of words that will be kept in the word index - I don't know why a thousand, I just found that in the "answer" notebook. The oov_token
is what's used when a word is encountered outside of the words we're building into our word-index (Out Of Vocabulary). The next step is to create the word-index by fitting the tokenizer to the text.
with TIMER:
tokenizer.fit_on_texts(cleaned)
2019-09-08 14:59:30,671 graeae.timers.timer start: Started: 2019-09-08 14:59:30.671536 I0908 14:59:30.671563 139839933974336 timer.py:70] Started: 2019-09-08 14:59:30.671536 2019-09-08 14:59:30,862 graeae.timers.timer end: Ended: 2019-09-08 14:59:30.862483 I0908 14:59:30.862523 139839933974336 timer.py:77] Ended: 2019-09-08 14:59:30.862483 2019-09-08 14:59:30,863 graeae.timers.timer end: Elapsed: 0:00:00.190947 I0908 14:59:30.863504 139839933974336 timer.py:78] Elapsed: 0:00:00.190947
The tokenizer now has a dictionary named word_index
that holds the words:index pairs for all the tokens found (it only uses the num_words
when you call tokenizer's methods according to Stack Overflow).
print(f"{len(tokenizer.word_index):,}")
24,339
Making the Sequences
I've trained the Tokenizer so that it has a word-index, but now we have to one hot encode our texts and pad them so they're all the same length.
MAX_LENGTH = 120
sequences = tokenizer.texts_to_sequences(cleaned)
padded = pad_sequences(sequences, padding="post", maxlen=MAX_LENGTH)
labels_sequenced = labels_tokenizer.texts_to_sequences(labels)
Make training and testing sets
TESTING = 0.2
x_train, x_test, y_train, y_test = train_test_split(
padded, labels_sequenced,
test_size=TESTING)
x_train, x_validation, y_train, y_validation = train_test_split(
x_train, y_train, test_size=TESTING)
y_train = numpy.array(y_train)
y_test = numpy.array(y_test)
y_validation = numpy.array(y_validation)
print(f"Training: {x_train.shape}")
print(f"Validation: {x_validation.shape}")
print(f"Testing: {x_test.shape}")
Training: (1424, 120) Validation: (356, 120) Testing: (445, 120)
Note: I originally forgot to pass the TESTING
variable with the keyword test_size
and got an error that I couldn't use a Singleton array - don't forget the keywords when you pass in anything other than the data to train_test_split
.
The Model
vocabulary_size = 1000
embedding_dimension = 16
max_length=120
model = tensorflow.keras.Sequential([
layers.Embedding(vocabulary_size, embedding_dimension,
input_length=max_length),
layers.GlobalAveragePooling1D(),
layers.Dense(24, activation="relu"),
layers.Dense(6, activation="softmax"),
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 120, 16) 16000 _________________________________________________________________ global_average_pooling1d_1 ( (None, 16) 0 _________________________________________________________________ dense_2 (Dense) (None, 24) 408 _________________________________________________________________ dense_3 (Dense) (None, 6) 150 ================================================================= Total params: 16,558 Trainable params: 16,558 Non-trainable params: 0 _________________________________________________________________ None
model.fit(x_train, y_train, epochs=30,
validation_data=(x_validation, y_validation), verbose=2)
Train on 1424 samples, validate on 356 samples Epoch 1/30 1424/1424 - 0s - loss: 1.7623 - accuracy: 0.2879 - val_loss: 1.7257 - val_accuracy: 0.5000 Epoch 2/30 1424/1424 - 0s - loss: 1.6871 - accuracy: 0.5190 - val_loss: 1.6332 - val_accuracy: 0.5281 Epoch 3/30 1424/1424 - 0s - loss: 1.5814 - accuracy: 0.4782 - val_loss: 1.5118 - val_accuracy: 0.4944 Epoch 4/30 1424/1424 - 0s - loss: 1.4417 - accuracy: 0.4677 - val_loss: 1.3543 - val_accuracy: 0.5365 Epoch 5/30 1424/1424 - 0s - loss: 1.2706 - accuracy: 0.5934 - val_loss: 1.1850 - val_accuracy: 0.7022 Epoch 6/30 1424/1424 - 0s - loss: 1.1075 - accuracy: 0.6749 - val_loss: 1.0387 - val_accuracy: 0.8006 Epoch 7/30 1424/1424 - 0s - loss: 0.9606 - accuracy: 0.8483 - val_loss: 0.9081 - val_accuracy: 0.8567 Epoch 8/30 1424/1424 - 0s - loss: 0.8244 - accuracy: 0.8869 - val_loss: 0.7893 - val_accuracy: 0.8848 Epoch 9/30 1424/1424 - 0s - loss: 0.6963 - accuracy: 0.9164 - val_loss: 0.6747 - val_accuracy: 0.8961 Epoch 10/30 1424/1424 - 0s - loss: 0.5815 - accuracy: 0.9228 - val_loss: 0.5767 - val_accuracy: 0.9185 Epoch 11/30 1424/1424 - 0s - loss: 0.4831 - accuracy: 0.9375 - val_loss: 0.4890 - val_accuracy: 0.9270 Epoch 12/30 1424/1424 - 0s - loss: 0.3991 - accuracy: 0.9473 - val_loss: 0.4195 - val_accuracy: 0.9326 Epoch 13/30 1424/1424 - 0s - loss: 0.3321 - accuracy: 0.9508 - val_loss: 0.3669 - val_accuracy: 0.9438 Epoch 14/30 1424/1424 - 0s - loss: 0.2800 - accuracy: 0.9572 - val_loss: 0.3268 - val_accuracy: 0.9494 Epoch 15/30 1424/1424 - 0s - loss: 0.2385 - accuracy: 0.9656 - val_loss: 0.2936 - val_accuracy: 0.9438 Epoch 16/30 1424/1424 - 0s - loss: 0.2053 - accuracy: 0.9740 - val_loss: 0.2693 - val_accuracy: 0.9466 Epoch 17/30 1424/1424 - 0s - loss: 0.1775 - accuracy: 0.9761 - val_loss: 0.2501 - val_accuracy: 0.9466 Epoch 18/30 1424/1424 - 0s - loss: 0.1557 - accuracy: 0.9789 - val_loss: 0.2332 - val_accuracy: 0.9494 Epoch 19/30 1424/1424 - 0s - loss: 0.1362 - accuracy: 0.9831 - val_loss: 0.2189 - val_accuracy: 0.9522 Epoch 20/30 1424/1424 - 0s - loss: 0.1209 - accuracy: 0.9853 - val_loss: 0.2082 - val_accuracy: 0.9551 Epoch 21/30 1424/1424 - 0s - loss: 0.1070 - accuracy: 0.9860 - val_loss: 0.1979 - val_accuracy: 0.9579 Epoch 22/30 1424/1424 - 0s - loss: 0.0952 - accuracy: 0.9888 - val_loss: 0.1897 - val_accuracy: 0.9551 Epoch 23/30 1424/1424 - 0s - loss: 0.0854 - accuracy: 0.9902 - val_loss: 0.1815 - val_accuracy: 0.9579 Epoch 24/30 1424/1424 - 0s - loss: 0.0765 - accuracy: 0.9916 - val_loss: 0.1761 - val_accuracy: 0.9522 Epoch 25/30 1424/1424 - 0s - loss: 0.0689 - accuracy: 0.9930 - val_loss: 0.1729 - val_accuracy: 0.9579 Epoch 26/30 1424/1424 - 0s - loss: 0.0618 - accuracy: 0.9951 - val_loss: 0.1680 - val_accuracy: 0.9551 Epoch 27/30 1424/1424 - 0s - loss: 0.0559 - accuracy: 0.9958 - val_loss: 0.1633 - val_accuracy: 0.9551 Epoch 28/30 1424/1424 - 0s - loss: 0.0505 - accuracy: 0.9958 - val_loss: 0.1594 - val_accuracy: 0.9579 Epoch 29/30 1424/1424 - 0s - loss: 0.0457 - accuracy: 0.9965 - val_loss: 0.1559 - val_accuracy: 0.9522 Epoch 30/30 1424/1424 - 0s - loss: 0.0416 - accuracy: 0.9972 - val_loss: 0.1544 - val_accuracy: 0.9551
It seems to get good suprisingly fast - it might be overfitting toward the end.
loss, accuracy =model.evaluate(x_test, y_test, verbose=0)
print(f"Loss: {loss: .2f} Accuracy: {accuracy:.2f}")
Loss: 0.16 Accuracy: 0.95
It does pretty well, even on the test set.
Plotting the Performance
data = pandas.DataFrame(model.history.history)
plot = data.hvplot().opts(title="Training Performance", width=1000, height=800)
Embed(plot=plot, file_name="model_performance")()
Unlike with the image classifications, the validation performance never quite matches the training performance (although it's quite good), probably because we aren't doing any kind of augmentation the way you tend to do with images.
End
Okay, so we seem to have a decent model, but is that really the end-game? No, we want to be able to predict what classification a new input should get.
index_to_label = {value:key for (key, value) in labels_tokenizer.word_index.items()}
def category(text: str) -> None:
"""Categorizes the text
Args:
text: text to categorize
"""
text = tokenizer.texts_to_sequences([text])
predictions = model.predict(pad_sequences(text, maxlen=MAX_LENGTH))
print(f"Predicted Category: {index_to_label[predictions.argmax()]}")
return
text = "crickets are nutritious and delicious but make for such a silly game"
category(text)
Predicted Category: sport
text = "i like butts that are big and round, something something like a xxx throw down, and so does the house of parliament"
category(text)
Predicted Category: sport
It kind of looks like it's biased toward sports.
text = "tv future hand viewer home theatre"
category(text)
Predicted Category: sport
Something isn't right here.