Introduction

In my last post I loaded the pyLDAvis widget by dumping the HTML/Javascript right into the org-mode document. The problem with doing this is that the document has a lot of lines of text in it, which slows down emacs a noticeable amount, making it hard to display one widget, and pretty much impractical to show more than one. So, since Nikola (or maybe bootstrap or one of the other plugins I'm using) is loading JQuery anyway, I'm going to use javascript to add the HTML after it loads from a file.

Imports

Python

datetime is just to show how long things take. In this case the data-set is fairly small so it doesn't take very long, but in other cases it might take a very long time to build the LDA model so I like to time it so I know the next time about how long I should wait.

from datetime import datetime
from pathlib import Path

From PyPi

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.sklearn

The Data

I'm going to use the Twenty Newsgroups data-set, not because of anything significant, but because sklearn has a downloader for it so I figured it'd be easiest.

path = Path("~/datasets/newsgroups/").expanduser()
newsgroups = fetch_20newsgroups(data_home=path, subset="train")

print(path)

/home/brunhilde/datasets/newsgroups

The newsgroups.data is a list, so it doesn't have a shape attribute like it would it it were a numpy array.

print("{:,}".format(len(newsgroups.data)))
print("{:.2f}".format(len(newsgroups.data)/18000))

11,314
0.63

The documentation for the fetch_20newsgroups function says that the full dataset has 18,000 entries, so we have about 63% of the full set.

The Vectorizer

I'm going to use sklearn's CountVectorizer to convert the newsgroups documents to arrays of token counts. This is about the visualization, not making an accurate model so I'm going to use the built-in tokenizer. I'm not sure what the fit method is for, but the fit_transform method returns the document-term matrix that we need (each row represents a document, the columns are the tokens, and the cells hold the counts for each token in the document).

started = datetime.now()
vectorizer = CountVectorizer(stop_words="english")
document_term_matrix = vectorizer.fit_transform(newsgroups.data)
print("Elapsed: {}".format(datetime.now() - started))

Elapsed: 0:00:03.033235

The LDA

Now we'll build the Latent Dirichlet Allocation Model.

start = datetime.now()
topics = len(newsgroups.target_names)
lda = LatentDirichletAllocation(topics)
lda.fit(document_term_matrix)
print("Elapsed: {}".format(datetime.now() - start))

Elapsed: 0:02:37.479097

PyLDAvis

Okay so here's where we try and get pyLDAvis into this thing.

Prepare the Data for the Visualization

The Prepared Data

The first step in using pyLDAvis is to create a PreparedData named-tuple using the prepare function.

start = datetime.now()
prepared_data = pyLDAvis.sklearn.prepare(lda, document_term_matrix, vectorizer)
print("Elapsed: {}".format(datetime.now() - start))

Elapsed: 0:00:34.293668

Build the HTML

Now we can create an HTML fragment using the prepared_data function. The output is a string of HTML script, style, and div tags. It adds the entire data-set as a javascript object so the more data you have, the longer the string will be.

div_id = "pyldavis-in-org-mode"
html = pyLDAvis.prepared_data_to_html(prepared_data,
                                      template_type="simple",
                                      visid=div_id)

Export the HTML

Now I'm going to save the html to a file so we can load it later.

slug = "pyldavis-in-org-mode-with-jquery"
posts = Path("../files/posts/")
folder = posts.joinpath(slug)
filename = "pyldavis_fragment.html"
if not folder.is_dir():
    folder.mkdir()

output = folder.joinpath(filename)
output.write_text(html)
assert output.is_file()

So here's where we create the HTML that will be embedded in this post. The JQuery load function puts the content of our saved file into the div. I added the css call because I have my site's font-size set to extra-large, since the Goudy Bookstyle looks too small to me otherwise (I think nice fonts look better when they're big), which causes the buttons in the pyLDAvis widget to overflow out of the header. Under normal circumstances you wouldn't need to do this, but if you do want to do any one-off styling, here's an example of how to do it. Otherwise maybe an update to the style-sheet would be better.

The right-hand box is still messed up, but it's good enough for this example.

print('''#+BEGIN_EXPORT html
<div id="{0}"></div>
<script>
$("#{0}").load("{1}")
$("#{0}-top").css("font-size", "large")
</script>
#+END_EXPORT'''.format(div_id, filename))

pyLDAvis In org-mode With JQuery

Table of Contents