This is some local documentation for NLTK 3.3. The official documentation is pretty extensive, so I’m initially only going to document parts of the NLTK as I encounter them and maybe add some notes to help me remember things about them.
This is the nltk.data
module.
The nltk.data.load()
function loads various nltk items for you.
load (resource_url[, format, cache, verbose, …]) |
Load a given resource from the NLTK data package. |
The PunktSentenceTokenizer
tokenizes a text into sentences using an unsupervised model. Because it needs to be trained before being used you normally load it using the nltk.data.load()
function which loads a pre-trained model in the language you specify.
PunktSentenceTokenizer ([train_text, …]) |
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. |
PunktSentenceTokenizer.tokenize (text[, …]) |
Given a text, returns a list of the sentences in that text. |
To load an English-language tokenizer using the nltk.data.load()
function you pass it the path to the pickle file.
tokenizer = nltk.data.load("tokenizers/punkt/PY3/english.pickle")
Then to use it you call its nltk.tokenize.punkt.PunktSentenceTokenizer.tokenize()
method.
sentences = tokenizer.tokenize(source)