Topic Modeling With Matrix Decomposition

Beginning

This is part of a walk-through of the fastai Code-First Introduction to NLP. In this post I'll be using Singular Value Decomposition (SVD) and Non-Negative Matrix Factorization (NMF) to group newsgroup posts. Both of these methods are statistical approaches that use the word-counts within documents to decide how similar they are (while ignoring things like word order).

Imports

Python

from functools import partial
import random

PyPi

from scipy import linalg
from sklearn import decomposition
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import hvplot.pandas
import matplotlib.pyplot as pyplot
import numpy
import pandas

Others

from graeae import EmbedHoloviews, Timer

Set Up

The Timer

TIMER = Timer()

Plotting

Embed = partial(
    EmbedHoloviews,
    folder_path="../../files/posts/fastai/topic-modeling-with-matrix-decomposition")

Middle

The Dataset

The dataset consists of ~18,000 newsgroup posts with 20 topics. To keep the computation down I'll only use a subset of the categories. I'm also going to only use the body of the posts.

keep = ("alt.atheism", "comp.graphics", "misc.forsale", "sci.crypt", "talk.politics.guns")
remove = ("headers", "footers", "quotes")
training = fetch_20newsgroups(subset="train", categories=keep, remove=remove)
testing = fetch_20newsgroups(subset="test", categories=keep, remove=remove)

I've run this more than once so there's no output, but the first time you run the fetch_20newsgroups function it downloads the dataset and you'll see some output mentioning this fact.

print(f"{training.filenames.shape[0]:,}")
print(f"{training.target.shape[0]:,}")
2,790
2,790

So, although the entire dataset has over 18,000 entries, our sub-set has fewer than 3,000.

print(numpy.unique(training.target))
[0 1 2 3 4]

So the categories don't seem to be preserved (I'm assuming that the five I kept weren't the first five in the original set) so you have to check anytime you pull a subset out of the data.

Lets see what one of the posts looks like.

print(random.choice(training.data))
      Just a question. 
      As a provider of a public BBS service - aren't you bound by law to gurantee
      intelligble access to the data of the users on the BBS, if police comes
      with sufficent authorisation ? I guessed this would be  a basic condition
      for such systems. (I did run a bbs some time ago, but that was in Switzerland)

The US doesn't yet have many laws covering BBSs - they're not common carriers,
they're not phone companies, they're just private machines or services
operated by businesses.  There's no obligation to keep records.
As Perry Metzger points out, if the police come with a search warrant,
you have to let them see what the warrant demands, if it exists,
and they generally can confiscate the equipment as "evidence"
(which is not Constitutionally valid, but we're only beginning to
develop court cases supporting us).  A court MAY be able to compel
you to tell them information you know, such as the encryption password
for the disk - there aren't any definitive cases yet, since it's a new
situation, and there probably aren't laws specifically covering it.
But the court can't force you to *know* the keys, and there are no
laws preventing you from allowing your users to have their own keys
for their own files without giving them to you.

Even in areas that do have established law, there is uncertainty.
There was a guy in Idaho a few years ago who had his business records
subpoenaed as evidence for taxes or some other business-restriction law,
so he gave the court the records.  Which were in Hebrew.
The US doesn't have laws forcing you to keep your records in English,
and these were the originals of the records.  HE didn't speak Hebrew,
and neither did anybody in the court organization.  Don't think they
were able to do much about it.

It might be illegal for your BBS to deny access to potential customers
based on race, religion, national origin, gender, or sexual preference;
it probably hasn't been tested in court, but it seems like a plausible
extension of anti-discrimination laws affecting other businesses.

Vectorizing

Here we'll convert the text to a matrix using sklearn's CountVectorizer. Interestingly, the Introduction to Information Retrieval book says the the trend has been towards not removing the most common words (stop word) but we'll be dropping them. There's a paper called Stop Word Lists in Free Open-source Software Packages which points out some problems with stop-word lists in general, but sklearn's list in particular. I don't know if sklearn has done anything to address their concerns since the paper came out, but the sklearn documentation includes a link to the paper so I would assume the problems are still there. Nonetheless, the fastai examples uses them so I will too.

vectorizer = CountVectorizer(stop_words="english")

The function we'le going to use doesn't accept the sparse matrices that are output by default so we'll make it a dense matrix after it's fit.

with TIMER:
    vectors = vectorizer.fit_transform(training.data).todense()
2020-01-01 16:26:48,048 graeae.timers.timer start: Started: 2020-01-01 16:26:48.047927
2020-01-01 16:26:48,466 graeae.timers.timer end: Ended: 2020-01-01 16:26:48.466285
2020-01-01 16:26:48,466 graeae.timers.timer end: Elapsed: 0:00:00.418358

That was much quicker than I thought it would be, probably because our dataset is so small.

vocabulary = vectorizer.get_feature_names()
print(f"{len(vocabulary):,}")
34,632

So our "vocabulary" is around 35,000 tokens.

Singular Value Decomposition (SVD)

Singular Value Decomposition is a linear algebra method to factor a matrix. The math is beyond me at this point, so I'll just try using it as a black box.

with TIMER:
    U, s, V = linalg.svd(vectors, full_matrices=False)
2020-01-01 16:26:50,508 graeae.timers.timer start: Started: 2020-01-01 16:26:50.508003
2020-01-01 16:27:23,979 graeae.timers.timer end: Ended: 2020-01-01 16:27:23.978988
2020-01-01 16:27:23,980 graeae.timers.timer end: Elapsed: 0:00:33.470985
s_frame = pandas.Series(s)
plot = s_frame.hvplot().opts(title="Diagonal Matrix S", width=1000, height=800)
Embed(plot=plot, file_name="s_values")()

Figure Missing

Looking At Some Topics

top_words_count = 8

def top_words(token):
    return [vocabulary[index] for index in numpy.argsort(token)[: -top_words_count - 1: -1]]

def show_topics(array):
    topic_words = ([top_words(topic) for topic in array])
    return [' '.join(topic) for topic in topic_words]
topics = show_topics(V[:10])
for index, topic in enumerate(topics):
    print(f"{index}: {topic}")
0: propagandist heliocentric galacticentric surname sandvik 400included wovy imaginative
1: file jpeg image edu pub ftp use graphics
2: file gun congress firearms control mr states rkba
3: privacy internet anonymous pub email information eff mail
4: graphics edu 128 3d ray pub data ftp
5: 00 50 40 appears dos 10 art 25
6: privacy internet 00 jpeg eff pub email electronic
7: key data image encryption des chip available law
8: pub key jesus jpeg eff graphics encryption ripem
9: key encryption edu des anonymous posting chip graphics

So what we're showing is the most significant words for the top-ten most strongly grouped "topics". It takes a little bit of interpretation to figure out how to map them to the newsgroups we used, and there probably could have been some clean-up of the texts (entry 5 looks suspect) but it's interesting that this linear algebra decomposition method could find these similar groups without any kind of prompting as to what groups might even exist in the first place (this is an unsupervised method, not a supervised method).

Non-negative Matrix Factorization (NMF)

number_of_topics = 5
classifier = decomposition.NMF(n_components=number_of_topics, random_state=1)
weights = classifier.fit_transform(vectors)
classified = classifier.components_
for index, topic in enumerate(show_topics(classified)):
    print(f"{index}: {topic}")
0: db mov bh si cs byte al bl
1: privacy internet anonymous information email eff use pub
2: file gun congress control firearms states mr united
3: jpeg image gif file color images format quality
4: edu graphics pub image data ftp mail available

Term-Frequency/Inverse Document Frequency

tfidf_vectorizer = TfidfVectorizer(stop_words="english")
tfidf_vectors = tfidf_vectorizer.fit_transform(training.data)
weights = classifier.fit_transform(tfidf_vectors)
classified = classifier.components_

for index, topic in enumerate(show_topics(classified)):
    print(f"{index}: {topic}")
0: people gun don think just guns right government
1: 00 sale offer shipping new drive price condition
2: key chip encryption clipper keys escrow government algorithm
3: graphics thanks file files image program know windows
4: god atheism believe does atheists belief said exist