pyLDAvis in org-mode

Imports

Python

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
from datetime import datetime
from pathlib import Path

From PyPi

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.sklearn

The Data

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
path = Path("~/datasets/newsgroups/").expanduser()
newsgroups = fetch_20newsgroups(data_home=path, subset="train")
/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
print(path)
/home/brunhilde/datasets/newsgroups

The newsgroups.data is a list, so it doesn't have a shape attribute like it would it it were a numpy array.

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
print("{:,}".format(len(newsgroups.data)))
11,314

The documentation for the fetch_20newsgroups function says that the full dataset has 18,000 entries, so we have about 63% of the full set.

The Vectorizer

I'm going to use sklearn's CountVectorizer to convert the newsgroups to convert the documents to arrays of token counts. This is about the visualization, not making an accurate model so I'm going to use the built-in tokenizer. I'm not sure what the fit method is for, but the fit_transform method returns the document-term matrix that we need (each row represents a document, the columns are the tokens, and the cells hold the counts for each token in the document).

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
started = datetime.now()
vectorizer = CountVectorizer(stop_words="english")
document_term_matrix = vectorizer.fit_transform(newsgroups.data)
print("Elapsed: {}".format(datetime.now() - started))
Elapsed: 0:00:02.798860

That was pretty fast, I guess this data set is sort of small.

The LDA

Now we'll build the Latent Dirichlet Allocation Model.

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
start = datetime.now()
topics = len(newsgroups.target_names)
lda = LatentDirichletAllocation(topics)
lda.fit(document_term_matrix)
print("Elapsed: {}".format(datetime.now() - start))
Elapsed: 0:02:30.557142

PyLDAvis

Okay so here's where we try and get pyLDAvis into this thing.

Prepare the Data for the Visualization

The Prepared Data

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
start = datetime.now()
prepared_data = pyLDAvis.sklearn.prepare(lda, document_term_matrix, vectorizer)
print("Elapsed: {}".format(datetime.now() - start))

Elapsed: 0:00:33.152028

Build the HTML

The HTML that creates the plot is fairly large. The browser seems to handle it okay, but emacs gets noticeably slower. I'll try the simple template to see if that makes any difference (the default works in both jupyter notebooks and any other HTML, but simple won't work in jupyter notebooks). I'm also going to set the ID because the CSS doesn't work so well with mine so I'm going to try and override the font-size on the header.

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
div_id = "pyldavis-in-org-mode"
html = pyLDAvis.prepared_data_to_html(prepared_data,
                                      template_type="simple",
                                      visid=div_id)

Embed the HTML

/home/athena/.virtualenvs/necromuralist.github.io/bin/python3: No module named virtualfish
print('''#+BEGIN_EXPORT html
{}
<script>
document.querySelector("div#{}-top").style.fontSize="large"
</script>
#+END_EXPORT'''.format(html, div_id))