pyLDAvis In org-mode With JQuery
Table of Contents
Introduction
In my last post I loaded the pyLDAvis widget by dumping the HTML/Javascript right into the org-mode document. The problem with doing this is that the document has a lot of lines of text in it, which slows down emacs a noticeable amount, making it hard to display one widget, and pretty much impractical to show more than one. So, since Nikola (or maybe bootstrap or one of the other plugins I'm using) is loading JQuery anyway, I'm going to use javascript to add the HTML after it loads from a file.
Imports
Python
datetime
is just to show how long things take. In this case the data-set is fairly small so it doesn't take very long, but in other cases it might take a very long time to build the LDA model so I like to time it so I know the next time about how long I should wait.
from datetime import datetime
from pathlib import Path
From PyPi
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.sklearn
The Data
I'm going to use the Twenty Newsgroups data-set, not because of anything significant, but because sklearn has a downloader for it so I figured it'd be easiest.
path = Path("~/datasets/newsgroups/").expanduser()
newsgroups = fetch_20newsgroups(data_home=path, subset="train")
print(path)
/home/brunhilde/datasets/newsgroups
The newsgroups.data
is a list, so it doesn't have a shape attribute like it would it it were a numpy array.
print("{:,}".format(len(newsgroups.data)))
print("{:.2f}".format(len(newsgroups.data)/18000))
11,314 0.63
The documentation for the fetch_20newsgroups
function says that the full dataset has 18,000 entries, so we have about 63% of the full set.
The Vectorizer
I'm going to use sklearn's CountVectorizer to convert the newsgroups documents to arrays of token counts. This is about the visualization, not making an accurate model so I'm going to use the built-in tokenizer. I'm not sure what the fit
method is for, but the fit_transform method returns the document-term matrix that we need (each row represents a document, the columns are the tokens, and the cells hold the counts for each token in the document).
started = datetime.now()
vectorizer = CountVectorizer(stop_words="english")
document_term_matrix = vectorizer.fit_transform(newsgroups.data)
print("Elapsed: {}".format(datetime.now() - started))
Elapsed: 0:00:03.033235
The LDA
Now we'll build the Latent Dirichlet Allocation Model.
start = datetime.now()
topics = len(newsgroups.target_names)
lda = LatentDirichletAllocation(topics)
lda.fit(document_term_matrix)
print("Elapsed: {}".format(datetime.now() - start))
Elapsed: 0:02:37.479097
PyLDAvis
Okay so here's where we try and get pyLDAvis into this thing.
Prepare the Data for the Visualization
The Prepared Data
The first step in using pyLDAvis is to create a PreparedData
named-tuple using the prepare function.
start = datetime.now()
prepared_data = pyLDAvis.sklearn.prepare(lda, document_term_matrix, vectorizer)
print("Elapsed: {}".format(datetime.now() - start))
Elapsed: 0:00:34.293668
Build the HTML
Now we can create an HTML fragment using the prepared_data function. The output is a string of HTML script, style, and div tags. It adds the entire data-set as a javascript object so the more data you have, the longer the string will be.
div_id = "pyldavis-in-org-mode"
html = pyLDAvis.prepared_data_to_html(prepared_data,
template_type="simple",
visid=div_id)
Export the HTML
Now I'm going to save the html to a file so we can load it later.
slug = "pyldavis-in-org-mode-with-jquery"
posts = Path("../files/posts/")
folder = posts.joinpath(slug)
filename = "pyldavis_fragment.html"
if not folder.is_dir():
folder.mkdir()
output = folder.joinpath(filename)
output.write_text(html)
assert output.is_file()
So here's where we create the HTML that will be embedded in this post. The JQuery load function puts the content of our saved file into the div. I added the css call because I have my site's font-size set to extra-large, since the Goudy Bookstyle looks too small to me otherwise (I think nice fonts look better when they're big), which causes the buttons in the pyLDAvis widget to overflow out of the header. Under normal circumstances you wouldn't need to do this, but if you do want to do any one-off styling, here's an example of how to do it. Otherwise maybe an update to the style-sheet would be better.
The right-hand box is still messed up, but it's good enough for this example.
print('''#+BEGIN_EXPORT html
<div id="{0}"></div>
<script>
$("#{0}").load("{1}")
$("#{0}-top").css("font-size", "large")
</script>
#+END_EXPORT'''.format(div_id, filename))