The Natural Language Toolkit

This is some local documentation for NLTK 3.3. The official documentation is pretty extensive, so I’m initially only going to document parts of the NLTK as I encounter them and maybe add some notes to help me remember things about them.


This is the module.


The function loads various nltk items for you.

load(resource_url[, format, cache, verbose, …]) Load a given resource from the NLTK data package.



The PunktSentenceTokenizer tokenizes a text into sentences using an unsupervised model. Because it needs to be trained before being used you normally load it using the function which loads a pre-trained model in the language you specify.

PunktSentenceTokenizer([train_text, …]) A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
PunktSentenceTokenizer.tokenize(text[, …]) Given a text, returns a list of the sentences in that text.

To load an English-language tokenizer using the function you pass it the path to the pickle file.

tokenizer ="tokenizers/punkt/PY3/english.pickle")

Then to use it you call its nltk.tokenize.punkt.PunktSentenceTokenizer.tokenize() method.

sentences = tokenizer.tokenize(source)