Sentences to Words

Cloistered Monkey

2018-07-29 20:40

# from pypi
from nltk.tokenize import (
    TreebankWordTokenizer,
    )

Introduction

We're going to look at an NLTK tokenizer that takes a sentence and breaks it up into words. In the simplest case you could just split the sentence on whitespace, but the presence of punctuation makes the job a little harder.

TreebankWordTokenizer

The TreebankWordTokenizer uses the Penn Treebank which created a corpus using Wall Street Journal articles.

source = "I'm a man who can't say no."

tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(source))

['I', "'m", 'a', 'man', 'who', 'ca', "n't", 'say', 'no', '.']

Looking at the output, you can see that words with contractions are broken up. At first this seemed odd to me, but when you realize that contractions are made up of multiple words this makes sense, although the actual output seems like it would be hard to use (how would you know tha 'I' "'m" means 'I' 'am'?).