The Natural Language Toolkit

This is some local documentation for NLTK 3.3. The official documentation is pretty extensive, so I’m initially only going to document parts of the NLTK as I encounter them and maybe add some notes to help me remember things about them.

Data

This is the nltk.data module.

Load

The nltk.data.load() function loads various nltk items for you.

load(resource_url[, format, cache, verbose, …]) Load a given resource from the NLTK data package.

Tokenizers

Punkt

The PunktSentenceTokenizer tokenizes a text into sentences using an unsupervised model. Because it needs to be trained before being used you normally load it using the nltk.data.load() function which loads a pre-trained model in the language you specify.

PunktSentenceTokenizer([train_text, …]) A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
PunktSentenceTokenizer.tokenize(text[, …]) Given a text, returns a list of the sentences in that text.

To load an English-language tokenizer using the nltk.data.load() function you pass it the path to the pickle file.

tokenizer = nltk.data.load("tokenizers/punkt/PY3/english.pickle")

Then to use it you call its nltk.tokenize.punkt.PunktSentenceTokenizer.tokenize() method.

sentences = tokenizer.tokenize(source)