pyLDAvis in org-mode



from datetime import datetime
from pathlib import Path

From PyPi

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.sklearn

The Data

path = Path("~/datasets/newsgroups/").expanduser()
newsgroups = fetch_20newsgroups(data_home=path, subset="train")

The is a list, so it doesn't have a shape attribute like it would it it were a numpy array.


The documentation for the fetch_20newsgroups function says that the full dataset has 18,000 entries, so we have about 63% of the full set.

The Vectorizer

I'm going to use sklearn's CountVectorizer to convert the newsgroups to convert the documents to arrays of token counts. This is about the visualization, not making an accurate model so I'm going to use the built-in tokenizer. I'm not sure what the fit method is for, but the fit_transform method returns the document-term matrix that we need (each row represents a document, the columns are the tokens, and the cells hold the counts for each token in the document).

started =
vectorizer = CountVectorizer(stop_words="english")
document_term_matrix = vectorizer.fit_transform(
print("Elapsed: {}".format( - started))
Elapsed: 0:00:02.798860

That was pretty fast, I guess this data set is sort of small.


Now we'll build the Latent Dirichlet Allocation Model.

start =
topics = len(newsgroups.target_names)
lda = LatentDirichletAllocation(topics)
print("Elapsed: {}".format( - start))
Elapsed: 0:02:30.557142


Okay so here's where we try and get pyLDAvis into this thing.

Prepare the Data for the Visualization

The Prepared Data

start =
prepared_data = pyLDAvis.sklearn.prepare(lda, document_term_matrix, vectorizer)
print("Elapsed: {}".format( - start))

Elapsed: 0:00:33.152028

Build the HTML

The HTML that creates the plot is fairly large. The browser seems to handle it okay, but emacs gets noticeably slower. I'll try the simple template to see if that makes any difference (the default works in both jupyter notebooks and any other HTML, but simple won't work in jupyter notebooks). I'm also going to set the ID because the CSS doesn't work so well with mine so I'm going to try and override the font-size on the header.

div_id = "pyldavis-in-org-mode"
html = pyLDAvis.prepared_data_to_html(prepared_data,

Embed the HTML

print('''#+BEGIN_EXPORT html
#+END_EXPORT'''.format(html, div_id))