# pyLDAvis in org-mode

## Imports

### Python

from datetime import datetime
from pathlib import Path


### From PyPi

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.sklearn


## The Data

path = Path("~/datasets/newsgroups/").expanduser()
newsgroups = fetch_20newsgroups(data_home=path, subset="train")

print(path)

/home/brunhilde/datasets/newsgroups



The newsgroups.data is a list, so it doesn't have a shape attribute like it would it it were a numpy array.

print("{:,}".format(len(newsgroups.data)))

11,314



The documentation for the fetch_20newsgroups function says that the full dataset has 18,000 entries, so we have about 63% of the full set.

## The Vectorizer

I'm going to use sklearn's CountVectorizer to convert the newsgroups to convert the documents to arrays of token counts. This is about the visualization, not making an accurate model so I'm going to use the built-in tokenizer. I'm not sure what the fit method is for, but the fit_transform method returns the document-term matrix that we need (each row represents a document, the columns are the tokens, and the cells hold the counts for each token in the document).

started = datetime.now()
vectorizer = CountVectorizer(stop_words="english")
document_term_matrix = vectorizer.fit_transform(newsgroups.data)
print("Elapsed: {}".format(datetime.now() - started))

Elapsed: 0:00:02.798860



That was pretty fast, I guess this data set is sort of small.

## The LDA

Now we'll build the Latent Dirichlet Allocation Model.

start = datetime.now()
topics = len(newsgroups.target_names)
lda = LatentDirichletAllocation(topics)
lda.fit(document_term_matrix)
print("Elapsed: {}".format(datetime.now() - start))

Elapsed: 0:02:30.557142



## PyLDAvis

Okay so here's where we try and get pyLDAvis into this thing.

### Prepare the Data for the Visualization

#### The Prepared Data

start = datetime.now()
prepared_data = pyLDAvis.sklearn.prepare(lda, document_term_matrix, vectorizer)
print("Elapsed: {}".format(datetime.now() - start))


Elapsed: 0:00:33.152028

#### Build the HTML

The HTML that creates the plot is fairly large. The browser seems to handle it okay, but emacs gets noticeably slower. I'll try the simple template to see if that makes any difference (the default works in both jupyter notebooks and any other HTML, but simple won't work in jupyter notebooks). I'm also going to set the ID because the CSS doesn't work so well with mine so I'm going to try and override the font-size on the header.

div_id = "pyldavis-in-org-mode"
html = pyLDAvis.prepared_data_to_html(prepared_data,
template_type="simple",
visid=div_id)


#### Embed the HTML

print('''#+BEGIN_EXPORT html
{}
<script>
document.querySelector("div#{}-top").style.fontSize="large"
</script>
#+END_EXPORT'''.format(html, div_id))