Cleaning the BBC News Archive

Beginning

This is an initial look at cleaning up a text dataset from the BBC News archives. Although the exercise sites this as the source the dataset provided doesn't look like the actual raw dataset which is broken up into folders that classify the contents and each news item is in a separate file. Instead we're starting with a partially pre-processed CSV that has been lower-cased and the classification is given as the first column in the dataset.

Imports

Python

from pathlib import Path

PyPi

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas

Graeae

from graeae import SubPathLoader, Timer

Set Up

The Environment

ENVIRONMENT = SubPathLoader("DATASETS")

The Timer

TIMER = Timer()

Middle

The DataSet

bbc_path = Path(ENVIRONMENT["BBC_NEWS"]).expanduser()
with TIMER:
    data = pandas.read_csv(bbc_path/"bbc-text.csv")
2019-08-25 18:51:38,411 graeae.timers.timer start: Started: 2019-08-25 18:51:38.411196
2019-08-25 18:51:38,658 graeae.timers.timer end: Ended: 2019-08-25 18:51:38.658181
2019-08-25 18:51:38,658 graeae.timers.timer end: Elapsed: 0:00:00.246985
print(data.shape)
(2225, 2)
print(data.sample().iloc[0])
category                                                sport
text        bell set for england debut bath prop duncan be...
Name: 2134, dtype: object

So we have two columns - category and text, text being the one we have to clean up.

print(data.text.dtype)
object

That's not such an informative answer, but I checked and each row of text is a single string.

The Tokenizer

The Keras Tokenizer tokenizes the text for us as well as removing the punctuation, lower-casing the text, and some other things. We're also going to use a Out-of-Vocabulary token of "<OOV>" to identify words that are outside of the vocabulary when converting new texts to sequences.

tokenizer = Tokenizer(oov_token="<OOV>", num_words=100)
tokenizer.fit_on_texts(data.text)
word_index = tokenizer.word_index
print(len(word_index))
29727

The word-index is a dict that maps words found in the documents to counts.

Convert the Texts To Sequences

We're going to convert each of our texts to a sequence of numbers representing the words in them (one-hot-encoding). The pad_sequences function adds zeros to the end of sequences that are shorter than the longest one so that they are all the same size.

sequences = tokenizer.texts_to_sequences(data.text)
padded = pad_sequences(sequences, padding="post")
print(padded[0])
print(padded.shape)
[1 1 7 ... 0 0 0]
(2225, 4491)

Strangely there doesn't appear to be a good way to use stopwords. Maybe sklearn is more appropriate here.

vectorizer = CountVectorizer(stop_words=stopwords.words("english"),
                             lowercase=True, min_df=3,
                             max_df=0.9, max_features=5000)
vectors = vectorizer.fit_transform(data.text)

End

Sources

The Original Dataset

  • D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. [PDF] [BibTeX].