pyLDAvis in org-mode

Imports

Python

from datetime import datetime
from pathlib import Path

From PyPi

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.sklearn

The Data

path = Path("~/datasets/newsgroups/").expanduser()
newsgroups = fetch_20newsgroups(data_home=path, subset="train")
print(path)
/home/brunhilde/datasets/newsgroups

The newsgroups.data is a list, so it doesn't have a shape attribute like it would it it were a numpy array.

print("{:,}".format(len(newsgroups.data)))
11,314

The documentation for the fetch_20newsgroups function says that the full dataset has 18,000 entries, so we have about 63% of the full set.

The Vectorizer

I'm going to use sklearn's CountVectorizer to convert the newsgroups to convert the documents to arrays of token counts. This is about the visualization, not making an accurate model so I'm going to use the built-in tokenizer. I'm not sure what the fit method is for, but the fit_transform method returns the document-term matrix that we need (each row represents a document, the columns are the tokens, and the cells hold the counts for each token in the document).

started = datetime.now()
vectorizer = CountVectorizer(stop_words="english")
document_term_matrix = vectorizer.fit_transform(newsgroups.data)
print("Elapsed: {}".format(datetime.now() - started))
Elapsed: 0:00:02.798860

That was pretty fast, I guess this data set is sort of small.

The LDA

Now we'll build the Latent Dirichlet Allocation Model.

start = datetime.now()
topics = len(newsgroups.target_names)
lda = LatentDirichletAllocation(topics)
lda.fit(document_term_matrix)
print("Elapsed: {}".format(datetime.now() - start))
Elapsed: 0:02:30.557142

PyLDAvis

Okay so here's where we try and get pyLDAvis into this thing.

Prepare the Data for the Visualization

The Prepared Data

start = datetime.now()
prepared_data = pyLDAvis.sklearn.prepare(lda, document_term_matrix, vectorizer)
print("Elapsed: {}".format(datetime.now() - start))

Elapsed: 0:00:33.152028

Build the HTML

The HTML that creates the plot is fairly large. The browser seems to handle it okay, but emacs gets noticeably slower. I'll try the simple template to see if that makes any difference (the default works in both jupyter notebooks and any other HTML, but simple won't work in jupyter notebooks). I'm also going to set the ID because the CSS doesn't work so well with mine so I'm going to try and override the font-size on the header.

div_id = "pyldavis-in-org-mode"
html = pyLDAvis.prepared_data_to_html(prepared_data,
                                      template_type="simple",
                                      visid=div_id)

Embed the HTML

print('''#+BEGIN_EXPORT html
{}
<script>
document.querySelector("div#{}-top").style.fontSize="large"
</script>
#+END_EXPORT'''.format(html, div_id))

Slip Box System Parts List

The Four Parts

A Capture System

This should be paper-based, or at least something that's always there and quick to use.

  • a notebook
  • index cards
  • loose paper
  • napkins…

A Reference System

This is where you put information about your sources. For books and papers Zotero is handy, although once again, the fact that I have to fire up this GUI-based program adds a little bit of overhead. The original system was just another box so I'm going to try something like that. Maybe a sub-folder…

The Slip Box

The original system was a wooden box with A6 paper. I'm using a static site with plain text (org-mode).

Something To Produce a Final Product

The system is aimed at writers, but I'm a computer programmer, and I think it might work with other types of output (like drawing) so it's really just having a way to produce something from your project.

Related Posts

Reference

  • HTTSN - How To Take Smart Notes

Using Your Slip Box

Introduction

This is my re-wording of the Slip Box Method.

The Method

Capture Everything

Write everything down - ideas don't count until they're out of your head and on paper. Writing it down also frees your mind to move on to other things.

Take Notes

Whenever you are taking in someone else's ideas (e.g. reading, listening) take notes.

Make Your Notes Permanent

The initial notes are just temporary inputs, later in the day you need to convert them into some form that has these attributes:

  • They are complete - write them for your future self, don't rely on being able to remember what else is needed to understand the note.
  • They are written in a way that relates to your interests.
  • There is only one idea per note.

Put the Permanent Notes in the Slip Box

  • When you file your note look through the other notes and try and place it behind a related note.
  • Add links to other notes that are related.
  • Add the new note to an entry-point note that holds links to other notes.

Work Bottoms-Up

  • Don't try and come up with a project by "thinking", look through the slip box and let it tell you what you're interested in.
  • If you have an idea for what to do but there isn't enough in the box yet, take more notes.
  • Keep the notes in one folder - don't sort them into sub-categories. This way you can make new associations that you didn't have when you first made the note.

Build Projects From Copies

Copy everything that seems relevant to a project folder on your desktop and see what needs to be filled in and what seems redundant (or maybe just wrong).

Translate Your Notes

Take these fragmented notes and convert them into a coherent argument. If you can't then look to take more notes to fill in what's missing.

Revise

Don't accept the first draft, edit, erase, re-do.

Move On

When you're done with the project, start over with a new one.

Implementation Details

The original method used paper and a wooden box. I really like paper and am tempted to try this, but I don't think doing something that makes me even more if a pack-rat is a good idea. The book (How To Take Smart Notes) recommends a computer program written specifically for this system, but I am a little leery of getting tied into one program, and all these GUI programs are starting to turn me off.

Instead, I'm going to try and use this blog as my slip-box, so, as far as "equipment" goes, this is what I'm going to use:

Flattening out the file-system makes it hard to browse the files, though. I guess less and ls are going to be the main thing I use (and maybe ag and deft). We'll see, I only started reading the book yesterday so I'm still trying to figure this out as I go.

Related Posts

Reference

  • HTTSN - How To Take Smart Notes

Bibliography: How To Take Smart Notes

Table of Contents

Description

This book describes the Slip-Box method developed by Niklas Luhmann. Its focus is on research writing, but seems like a good system for projects in general. It points that people are generally taught that you should work in a series of "next-steps", but if you are doing something creative (or at least something you haven't done before) then this is an impractical, if not impossible, way to work. Instead the author proposes that you use a system of note-taking to capture everything and then look for patterns in your notes - a bottoms-up approach rather than a top-down one.

Reference

[HTTSN] Ahrens S. How to take smart notes: one simple technique to boost writing, learning and thinking: for students, academics and nonfiction book writers. North Charleston, SC: CreateSpace; 2017. 170 p.