nltk.tokenize.punkt.PunktSentenceTokenizer¶

class nltk.tokenize.punkt.PunktSentenceTokenizer(train_text=None, verbose=False, lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]¶

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

__init__(train_text=None, verbose=False, lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]¶: train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

Methods

`__init__`([train_text, verbose, lang_vars, …])	train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.
`debug_decisions`(text)	Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.
`dump`(tokens)
`sentences_from_text`(text[, realign_boundaries])	Given a text, generates the sentences in that text by only testing candidate sentence breaks.
`sentences_from_text_legacy`(text)	Given a text, generates the sentences in that text.
`sentences_from_tokens`(tokens)	Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
`span_tokenize`(text[, realign_boundaries])	Given a text, generates (start, end) spans of sentences in the text.
`span_tokenize_sents`(strings)	Apply `self.span_tokenize()` to each element of `strings`.
`text_contains_sentbreak`(text)	Returns True if the given text includes a sentence break.
`tokenize`(text[, realign_boundaries])	Given a text, returns a list of the sentences in that text.
`tokenize_sents`(strings)	Apply `self.tokenize()` to each element of `strings`.
`train`(train_text[, verbose])	Derives parameters from a given training text, or uses the parameters given.

Attributes

PUNCTUATION

debug_decisions(text)[source]¶

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

sentences_from_text(text, realign_boundaries=True)[source]¶: Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

sentences_from_text_legacy(text)[source]¶: Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

sentences_from_tokens(tokens)[source]¶: Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

span_tokenize(text, realign_boundaries=True)[source]¶: Given a text, generates (start, end) spans of sentences in the text.

span_tokenize_sents(strings)¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Return type:	iter(list(tuple(int, int)))

text_contains_sentbreak(text)[source]¶: Returns True if the given text includes a sentence break.

tokenize(text, realign_boundaries=True)[source]¶: Given a text, returns a list of the sentences in that text.

tokenize_sents(strings)¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type:	list(list(str))

train(train_text, verbose=False)[source]¶: Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.